src/ceph/doc/architecture.rst

   1 ==============
   2  Architecture
   3 ==============
   4
   5 :term:`Ceph` uniquely delivers **object, block, and file storage** in one
   6 unified system. Ceph is highly reliable, easy to manage, and free. The power of
   7 Ceph can transform your company's IT infrastructure and your ability to manage
   8 vast amounts of data. Ceph delivers extraordinary scalability–thousands of
   9 clients accessing petabytes to exabytes of data. A :term:`Ceph Node` leverages
  10 commodity hardware and intelligent daemons, and a :term:`Ceph Storage Cluster`
  11 accommodates large numbers of nodes, which communicate with each other to
  12 replicate and redistribute data dynamically.
  13
  14 .. image:: images/stack.png
  15
  16
  17 The Ceph Storage Cluster
  18 ========================
  19
  20 Ceph provides an infinitely scalable :term:`Ceph Storage Cluster` based upon
  21 :abbr:`RADOS (Reliable Autonomic Distributed Object Store)`, which you can read
  22 about in `RADOS - A Scalable, Reliable Storage Service for Petabyte-scale
  23 Storage Clusters`_.
  24
  25 A Ceph Storage Cluster consists of two types of daemons:
  26
  27 - :term:`Ceph Monitor`
  28 - :term:`Ceph OSD Daemon`
  29
  30 .. ditaa::  +---------------+ +---------------+
  31             |      OSDs     | |    Monitors   |
  32             +---------------+ +---------------+
  33
  34 A Ceph Monitor maintains a master copy of the cluster map. A cluster of Ceph
  35 monitors ensures high availability should a monitor daemon fail. Storage cluster
  36 clients retrieve a copy of the cluster map from the Ceph Monitor.
  37
  38 A Ceph OSD Daemon checks its own state and the state of other OSDs and reports
  39 back to monitors.
  40
  41 Storage cluster clients and each :term:`Ceph OSD Daemon` use the CRUSH algorithm
  42 to efficiently compute information about data location, instead of having to
  43 depend on a central lookup table. Ceph's high-level features include providing a
  44 native interface to the Ceph Storage Cluster via ``librados``, and a number of
  45 service interfaces built on top of ``librados``.
  46
  47
  48
  49 Storing Data
  50 ------------
  51
  52 The Ceph Storage Cluster receives data from :term:`Ceph Clients`--whether it
  53 comes through a :term:`Ceph Block Device`, :term:`Ceph Object Storage`, the
  54 :term:`Ceph Filesystem` or a custom implementation you create using
  55 ``librados``--and it stores the data as objects. Each object corresponds to a
  56 file in a filesystem, which is stored on an :term:`Object Storage Device`. Ceph
  57 OSD Daemons handle the read/write operations on the storage disks.
  58
  59 .. ditaa:: /-----\       +-----+       +-----+
  60            | obj |------>| {d} |------>| {s} |
  61            \-----/       +-----+       +-----+
  62
  63             Object         File         Disk
  64
  65 Ceph OSD Daemons store all data as objects in a flat namespace (e.g., no
  66 hierarchy of directories). An object has an identifier, binary data, and
  67 metadata consisting of a set of name/value pairs. The semantics are completely
  68 up to :term:`Ceph Clients`. For example, CephFS uses metadata to store file
  69 attributes such as the file owner, created date, last modified date, and so
  70 forth.
  71
  72
  73 .. ditaa:: /------+------------------------------+----------------\
  74            | ID   | Binary Data                  | Metadata       |
  75            +------+------------------------------+----------------+
  76            | 1234 | 0101010101010100110101010010 | name1 = value1 |
  77            |      | 0101100001010100110101010010 | name2 = value2 |
  78            |      | 0101100001010100110101010010 | nameN = valueN |
  79            \------+------------------------------+----------------/
  80
  81 .. note:: An object ID is unique across the entire cluster, not just the local
  82    filesystem.
  83
  84
  85 .. index:: architecture; high availability, scalability
  86
  87 Scalability and High Availability
  88 ---------------------------------
  89
  90 In traditional architectures, clients talk to a centralized component (e.g., a
  91 gateway, broker, API, facade, etc.), which acts as a single point of entry to a
  92 complex subsystem. This imposes a limit to both performance and scalability,
  93 while introducing a single point of failure (i.e., if the centralized component
  94 goes down, the whole system goes down, too).
  95
  96 Ceph eliminates the centralized gateway to enable clients to interact with
  97 Ceph OSD Daemons directly. Ceph OSD Daemons create object replicas on other
  98 Ceph Nodes to ensure data safety and high availability. Ceph also uses a cluster
  99 of monitors to ensure high availability. To eliminate centralization, Ceph
 100 uses an algorithm called CRUSH.
 101
 102
 103 .. index:: CRUSH; architecture
 104
 105 CRUSH Introduction
 106 ~~~~~~~~~~~~~~~~~~
 107
 108 Ceph Clients and Ceph OSD Daemons both use the :abbr:`CRUSH (Controlled
 109 Replication Under Scalable Hashing)` algorithm to efficiently compute
 110 information about object location, instead of having to depend on a
 111 central lookup table. CRUSH provides a better data management mechanism compared
 112 to older approaches, and enables massive scale by cleanly distributing the work
 113 to all the clients and OSD daemons in the cluster. CRUSH uses intelligent data
 114 replication to ensure resiliency, which is better suited to hyper-scale storage.
 115 The following sections provide additional details on how CRUSH works. For a
 116 detailed discussion of CRUSH, see `CRUSH - Controlled, Scalable, Decentralized
 117 Placement of Replicated Data`_.
 118
 119 .. index:: architecture; cluster map
 120
 121 Cluster Map
 122 ~~~~~~~~~~~
 123
 124 Ceph depends upon Ceph Clients and Ceph OSD Daemons having knowledge of the
 125 cluster topology, which is inclusive of 5 maps collectively referred to as the
 126 "Cluster Map":
 127
 128 #. **The Monitor Map:** Contains the cluster ``fsid``, the position, name
 129    address and port of each monitor. It also indicates the current epoch,
 130    when the map was created, and the last time it changed. To view a monitor
 131    map, execute ``ceph mon dump``.
 132
 133 #. **The OSD Map:** Contains the cluster ``fsid``, when the map was created and
 134    last modified, a list of pools, replica sizes, PG numbers, a list of OSDs
 135    and their status (e.g., ``up``, ``in``). To view an OSD map, execute
 136    ``ceph osd dump``.
 137
 138 #. **The PG Map:** Contains the PG version, its time stamp, the last OSD
 139    map epoch, the full ratios, and details on each placement group such as
 140    the PG ID, the `Up Set`, the `Acting Set`, the state of the PG (e.g.,
 141    ``active + clean``), and data usage statistics for each pool.
 142
 143 #. **The CRUSH Map:** Contains a list of storage devices, the failure domain
 144    hierarchy (e.g., device, host, rack, row, room, etc.), and rules for
 145    traversing the hierarchy when storing data. To view a CRUSH map, execute
 146    ``ceph osd getcrushmap -o {filename}``; then, decompile it by executing
 147    ``crushtool -d {comp-crushmap-filename} -o {decomp-crushmap-filename}``.
 148    You can view the decompiled map in a text editor or with ``cat``.
 149
 150 #. **The MDS Map:** Contains the current MDS map epoch, when the map was
 151    created, and the last time it changed. It also contains the pool for
 152    storing metadata, a list of metadata servers, and which metadata servers
 153    are ``up`` and ``in``. To view an MDS map, execute ``ceph fs dump``.
 154
 155 Each map maintains an iterative history of its operating state changes. Ceph
 156 Monitors maintain a master copy of the cluster map including the cluster
 157 members, state, changes, and the overall health of the Ceph Storage Cluster.
 158
 159 .. index:: high availability; monitor architecture
 160
 161 High Availability Monitors
 162 ~~~~~~~~~~~~~~~~~~~~~~~~~~
 163
 164 Before Ceph Clients can read or write data, they must contact a Ceph Monitor
 165 to obtain the most recent copy of the cluster map. A Ceph Storage Cluster
 166 can operate with a single monitor; however, this introduces a single
 167 point of failure (i.e., if the monitor goes down, Ceph Clients cannot
 168 read or write data).
 169
 170 For added reliability and fault tolerance, Ceph supports a cluster of monitors.
 171 In a cluster of monitors, latency and other faults can cause one or more
 172 monitors to fall behind the current state of the cluster. For this reason, Ceph
 173 must have agreement among various monitor instances regarding the state of the
 174 cluster. Ceph always uses a majority of monitors (e.g., 1, 2:3, 3:5, 4:6, etc.)
 175 and the `Paxos`_ algorithm to establish a consensus among the monitors about the
 176 current state of the cluster.
 177
 178 For details on configuring monitors, see the `Monitor Config Reference`_.
 179
 180 .. index:: architecture; high availability authentication
 181
 182 High Availability Authentication
 183 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 184
 185 To identify users and protect against man-in-the-middle attacks, Ceph provides
 186 its ``cephx`` authentication system to authenticate users and daemons.
 187
 188 .. note:: The ``cephx`` protocol does not address data encryption in transport
 189    (e.g., SSL/TLS) or encryption at rest.
 190
 191 Cephx uses shared secret keys for authentication, meaning both the client and
 192 the monitor cluster have a copy of the client's secret key. The authentication
 193 protocol is such that both parties are able to prove to each other they have a
 194 copy of the key without actually revealing it. This provides mutual
 195 authentication, which means the cluster is sure the user possesses the secret
 196 key, and the user is sure that the cluster has a copy of the secret key.
 197
 198 A key scalability feature of Ceph is to avoid a centralized interface to the
 199 Ceph object store, which means that Ceph clients must be able to interact with
 200 OSDs directly. To protect data, Ceph provides its ``cephx`` authentication
 201 system, which authenticates users operating Ceph clients. The ``cephx`` protocol
 202 operates in a manner with behavior similar to `Kerberos`_.
 203
 204 A user/actor invokes a Ceph client to contact a monitor. Unlike Kerberos, each
 205 monitor can authenticate users and distribute keys, so there is no single point
 206 of failure or bottleneck when using ``cephx``. The monitor returns an
 207 authentication data structure similar to a Kerberos ticket that contains a
 208 session key for use in obtaining Ceph services.  This session key is itself
 209 encrypted with the user's permanent  secret key, so that only the user can
 210 request services from the Ceph Monitor(s). The client then uses the session key
 211 to request its desired services from the monitor, and the monitor provides the
 212 client with a ticket that will authenticate the client to the OSDs that actually
 213 handle data. Ceph Monitors and OSDs share a secret, so the client can use the
 214 ticket provided by the monitor with any OSD or metadata server in the cluster.
 215 Like Kerberos, ``cephx`` tickets expire, so an attacker cannot use an expired
 216 ticket or session key obtained surreptitiously. This form of authentication will
 217 prevent attackers with access to the communications medium from either creating
 218 bogus messages under another user's identity or altering another user's
 219 legitimate messages, as long as the user's secret key is not divulged before it
 220 expires.
 221
 222 To use ``cephx``, an administrator must set up users first. In the following
 223 diagram, the ``client.admin`` user invokes  ``ceph auth get-or-create-key`` from
 224 the command line to generate a username and secret key. Ceph's ``auth``
 225 subsystem generates the username and key, stores a copy with the monitor(s) and
 226 transmits the user's secret back to the ``client.admin`` user. This means that
 227 the client and the monitor share a secret key.
 228
 229 .. note:: The ``client.admin`` user must provide the user ID and
 230    secret key to the user in a secure manner.
 231
 232 .. ditaa:: +---------+     +---------+
 233            | Client  |     | Monitor |
 234            +---------+     +---------+
 235                 |  request to   |
 236                 | create a user |
 237                 |-------------->|----------+ create user
 238                 |               |          | and
 239                 |<--------------|<---------+ store key
 240                 | transmit key  |
 241                 |               |
 242
 243
 244 To authenticate with the monitor, the client passes in the user name to the
 245 monitor, and the monitor generates a session key and encrypts it with the secret
 246 key associated to the user name. Then, the monitor transmits the encrypted
 247 ticket back to the client. The client then decrypts the payload with the shared
 248 secret key to retrieve the session key. The session key identifies the user for
 249 the current session. The client then requests a ticket on behalf of the user
 250 signed by the session key. The monitor generates a ticket, encrypts it with the
 251 user's secret key and transmits it back to the client. The client decrypts the
 252 ticket and uses it to sign requests to OSDs and metadata servers throughout the
 253 cluster.
 254
 255 .. ditaa:: +---------+     +---------+
 256            | Client  |     | Monitor |
 257            +---------+     +---------+
 258                 |  authenticate |
 259                 |-------------->|----------+ generate and
 260                 |               |          | encrypt
 261                 |<--------------|<---------+ session key
 262                 | transmit      |
 263                 | encrypted     |
 264                 | session key   |
 265                 |               |
 266                 |-----+ decrypt |
 267                 |     | session |
 268                 |<----+ key     |
 269                 |               |
 270                 |  req. ticket  |
 271                 |-------------->|----------+ generate and
 272                 |               |          | encrypt
 273                 |<--------------|<---------+ ticket
 274                 | recv. ticket  |
 275                 |               |
 276                 |-----+ decrypt |
 277                 |     | ticket  |
 278                 |<----+         |
 279
 280
 281 The ``cephx`` protocol authenticates ongoing communications between the client
 282 machine and the Ceph servers. Each message sent between a client and server,
 283 subsequent to the initial authentication, is signed using a ticket that the
 284 monitors, OSDs and metadata servers can verify with their shared secret.
 285
 286 .. ditaa:: +---------+     +---------+     +-------+     +-------+
 287            |  Client |     | Monitor |     |  MDS  |     |  OSD  |
 288            +---------+     +---------+     +-------+     +-------+
 289                 |  request to   |              |             |
 290                 | create a user |              |             |
 291                 |-------------->| mon and      |             |
 292                 |<--------------| client share |             |
 293                 |    receive    | a secret.    |             |
 294                 | shared secret |              |             |
 295                 |               |<------------>|             |
 296                 |               |<-------------+------------>|
 297                 |               | mon, mds,    |             |
 298                 | authenticate  | and osd      |             |
 299                 |-------------->| share        |             |
 300                 |<--------------| a secret     |             |
 301                 |  session key  |              |             |
 302                 |               |              |             |
 303                 |  req. ticket  |              |             |
 304                 |-------------->|              |             |
 305                 |<--------------|              |             |
 306                 | recv. ticket  |              |             |
 307                 |               |              |             |
 308                 |   make request (CephFS only) |             |
 309                 |----------------------------->|             |
 310                 |<-----------------------------|             |
 311                 | receive response (CephFS only)             |
 312                 |                                            |
 313                 |                make request                |
 314                 |------------------------------------------->|
 315                 |<-------------------------------------------|
 316                                receive response
 317
 318 The protection offered by this authentication is between the Ceph client and the
 319 Ceph server hosts. The authentication is not extended beyond the Ceph client. If
 320 the user accesses the Ceph client from a remote host, Ceph authentication is not
 321 applied to the connection between the user's host and the client host.
 322
 323
 324 For configuration details, see `Cephx Config Guide`_. For user management
 325 details, see `User Management`_.
 326
 327
 328 .. index:: architecture; smart daemons and scalability
 329
 330 Smart Daemons Enable Hyperscale
 331 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 332
 333 In many clustered architectures, the primary purpose of cluster membership is
 334 so that a centralized interface knows which nodes it can access. Then the
 335 centralized interface provides services to the client through a double
 336 dispatch--which is a **huge** bottleneck at the petabyte-to-exabyte scale.
 337
 338 Ceph eliminates the bottleneck: Ceph's OSD Daemons AND Ceph Clients are cluster
 339 aware. Like Ceph clients, each Ceph OSD Daemon knows about other Ceph OSD
 340 Daemons in the cluster.  This enables Ceph OSD Daemons to interact directly with
 341 other Ceph OSD Daemons and Ceph Monitors. Additionally, it enables Ceph Clients
 342 to interact directly with Ceph OSD Daemons.
 343
 344 The ability of Ceph Clients, Ceph Monitors and Ceph OSD Daemons to interact with
 345 each other means that Ceph OSD Daemons can utilize the CPU and RAM of the Ceph
 346 nodes to easily perform tasks that would bog down a centralized server. The
 347 ability to leverage this computing power leads to several major benefits:
 348
 349 #. **OSDs Service Clients Directly:** Since any network device has a limit to
 350    the number of concurrent connections it can support, a centralized system
 351    has a low physical limit at high scales. By enabling Ceph Clients to contact
 352    Ceph OSD Daemons directly, Ceph increases both performance and total system
 353    capacity simultaneously, while removing a single point of failure. Ceph
 354    Clients can maintain a session when they need to, and with a particular Ceph
 355    OSD Daemon instead of a centralized server.
 356
 357 #. **OSD Membership and Status**: Ceph OSD Daemons join a cluster and report
 358    on their status. At the lowest level, the Ceph OSD Daemon status is ``up``
 359    or ``down`` reflecting whether or not it is running and able to service
 360    Ceph Client requests. If a Ceph OSD Daemon is ``down`` and ``in`` the Ceph
 361    Storage Cluster, this status may indicate the failure of the Ceph OSD
 362    Daemon. If a Ceph OSD Daemon is not running (e.g., it crashes), the Ceph OSD
 363    Daemon cannot notify the Ceph Monitor that it is ``down``. The OSDs
 364    periodically send messages to the Ceph Monitor (``MPGStats`` pre-luminous,
 365    and a new ``MOSDBeacon`` in luminous).  If the Ceph Monitor doesn't see that
 366    message after a configurable period of time then it marks the OSD down.
 367    This mechanism is a failsafe, however. Normally, Ceph OSD Daemons will
 368    determine if a neighboring OSD is down and report it to the Ceph Monitor(s).
 369    This assures that Ceph Monitors are lightweight processes.  See `Monitoring
 370    OSDs`_ and `Heartbeats`_ for additional details.
 371
 372 #. **Data Scrubbing:** As part of maintaining data consistency and cleanliness,
 373    Ceph OSD Daemons can scrub objects within placement groups. That is, Ceph
 374    OSD Daemons can compare object metadata in one placement group with its
 375    replicas in placement groups stored on other OSDs. Scrubbing (usually
 376    performed daily) catches bugs or filesystem errors. Ceph OSD Daemons also
 377    perform deeper scrubbing by comparing data in objects bit-for-bit. Deep
 378    scrubbing (usually performed weekly) finds bad sectors on a drive that
 379    weren't apparent in a light scrub. See `Data Scrubbing`_ for details on
 380    configuring scrubbing.
 381
 382 #. **Replication:** Like Ceph Clients, Ceph OSD Daemons use the CRUSH
 383    algorithm, but the Ceph OSD Daemon uses it to compute where replicas of
 384    objects should be stored (and for rebalancing). In a typical write scenario,
 385    a client uses the CRUSH algorithm to compute where to store an object, maps
 386    the object to a pool and placement group, then looks at the CRUSH map to
 387    identify the primary OSD for the placement group.
 388
 389    The client writes the object to the identified placement group in the
 390    primary OSD. Then, the primary OSD with its own copy of the CRUSH map
 391    identifies the secondary and tertiary OSDs for replication purposes, and
 392    replicates the object to the appropriate placement groups in the secondary
 393    and tertiary OSDs (as many OSDs as additional replicas), and responds to the
 394    client once it has confirmed the object was stored successfully.
 395
 396 .. ditaa::
 397              +----------+
 398              |  Client  |
 399              |          |
 400              +----------+
 401                  *  ^
 402       Write (1)  |  |  Ack (6)
 403                  |  |
 404                  v  *
 405             +-------------+
 406             | Primary OSD |
 407             |             |
 408             +-------------+
 409               *  ^   ^  *
 410     Write (2) |  |   |  |  Write (3)
 411        +------+  |   |  +------+
 412        |  +------+   +------+  |
 413        |  | Ack (4)  Ack (5)|  |
 414        v  *                 *  v
 415  +---------------+   +---------------+
 416  | Secondary OSD |   | Tertiary OSD  |
 417  |               |   |               |
 418  +---------------+   +---------------+
 419
 420 With the ability to perform data replication, Ceph OSD Daemons relieve Ceph
 421 clients from that duty, while ensuring high data availability and data safety.
 422
 423
 424 Dynamic Cluster Management
 425 --------------------------
 426
 427 In the `Scalability and High Availability`_ section, we explained how Ceph uses
 428 CRUSH, cluster awareness and intelligent daemons to scale and maintain high
 429 availability. Key to Ceph's design is the autonomous, self-healing, and
 430 intelligent Ceph OSD Daemon. Let's take a deeper look at how CRUSH works to
 431 enable modern cloud storage infrastructures to place data, rebalance the cluster
 432 and recover from faults dynamically.
 433
 434 .. index:: architecture; pools
 435
 436 About Pools
 437 ~~~~~~~~~~~
 438
 439 The Ceph storage system supports the notion of 'Pools', which are logical
 440 partitions for storing objects.
 441
 442 Ceph Clients retrieve a `Cluster Map`_ from a Ceph Monitor, and write objects to
 443 pools. The pool's ``size`` or number of replicas, the CRUSH ruleset and the
 444 number of placement groups determine how Ceph will place the data.
 445
 446 .. ditaa::
 447             +--------+  Retrieves  +---------------+
 448             | Client |------------>|  Cluster Map  |
 449             +--------+             +---------------+
 450                  |
 451                  v      Writes
 452               /-----\
 453               | obj |
 454               \-----/
 455                  |      To
 456                  v
 457             +--------+           +---------------+
 458             |  Pool  |---------->| CRUSH Ruleset |
 459             +--------+  Selects  +---------------+
 460
 461
 462 Pools set at least the following parameters:
 463
 464 - Ownership/Access to Objects
 465 - The Number of Placement Groups, and
 466 - The CRUSH Ruleset to Use.
 467
 468 See `Set Pool Values`_ for details.
 469
 470
 471 .. index: architecture; placement group mapping
 472
 473 Mapping PGs to OSDs
 474 ~~~~~~~~~~~~~~~~~~~
 475
 476 Each pool has a number of placement groups. CRUSH maps PGs to OSDs dynamically.
 477 When a Ceph Client stores objects, CRUSH will map each object to a placement
 478 group.
 479
 480 Mapping objects to placement groups creates a layer of indirection between the
 481 Ceph OSD Daemon and the Ceph Client. The Ceph Storage Cluster must be able to
 482 grow (or shrink) and rebalance where it stores objects dynamically. If the Ceph
 483 Client "knew" which Ceph OSD Daemon had which object, that would create a tight
 484 coupling between the Ceph Client and the Ceph OSD Daemon. Instead, the CRUSH
 485 algorithm maps each object to a placement group and then maps each placement
 486 group to one or more Ceph OSD Daemons. This layer of indirection allows Ceph to
 487 rebalance dynamically when new Ceph OSD Daemons and the underlying OSD devices
 488 come online. The following diagram depicts how CRUSH maps objects to placement
 489 groups, and placement groups to OSDs.
 490
 491 .. ditaa::
 492            /-----\  /-----\  /-----\  /-----\  /-----\
 493            | obj |  | obj |  | obj |  | obj |  | obj |
 494            \-----/  \-----/  \-----/  \-----/  \-----/
 495               |        |        |        |        |
 496               +--------+--------+        +---+----+
 497               |                              |
 498               v                              v
 499    +-----------------------+      +-----------------------+
 500    |  Placement Group #1   |      |  Placement Group #2   |
 501    |                       |      |                       |
 502    +-----------------------+      +-----------------------+
 503                |                              |
 504                |      +-----------------------+---+
 505         +------+------+-------------+             |
 506         |             |             |             |
 507         v             v             v             v
 508    /----------\  /----------\  /----------\  /----------\
 509    |          |  |          |  |          |  |          |
 510    |  OSD #1  |  |  OSD #2  |  |  OSD #3  |  |  OSD #4  |
 511    |          |  |          |  |          |  |          |
 512    \----------/  \----------/  \----------/  \----------/
 513
 514 With a copy of the cluster map and the CRUSH algorithm, the client can compute
 515 exactly which OSD to use when reading or writing a particular object.
 516
 517 .. index:: architecture; calculating PG IDs
 518
 519 Calculating PG IDs
 520 ~~~~~~~~~~~~~~~~~~
 521
 522 When a Ceph Client binds to a Ceph Monitor, it retrieves the latest copy of the
 523 `Cluster Map`_. With the cluster map, the client knows about all of the monitors,
 524 OSDs, and metadata servers in the cluster. **However, it doesn't know anything
 525 about object locations.**
 526
 527 .. epigraph::
 528
 529         Object locations get computed.
 530
 531
 532 The only input required by the client is the object ID and the pool.
 533 It's simple: Ceph stores data in named pools (e.g., "liverpool"). When a client
 534 wants to store a named object (e.g., "john," "paul," "george," "ringo", etc.)
 535 it calculates a placement group using the object name, a hash code, the
 536 number of PGs in the pool and the pool name. Ceph clients use the following
 537 steps to compute PG IDs.
 538
 539 #. The client inputs the pool ID and the object ID. (e.g., pool = "liverpool"
 540    and object-id = "john")
 541 #. Ceph takes the object ID and hashes it.
 542 #. Ceph calculates the hash modulo the number of PGs. (e.g., ``58``) to get
 543    a PG ID.
 544 #. Ceph gets the pool ID given the pool name (e.g., "liverpool" = ``4``)
 545 #. Ceph prepends the pool ID to the PG ID (e.g., ``4.58``).
 546
 547 Computing object locations is much faster than performing object location query
 548 over a chatty session. The :abbr:`CRUSH (Controlled Replication Under Scalable
 549 Hashing)` algorithm allows a client to compute where objects *should* be stored,
 550 and enables the client to contact the primary OSD to store or retrieve the
 551 objects.
 552
 553 .. index:: architecture; PG Peering
 554
 555 Peering and Sets
 556 ~~~~~~~~~~~~~~~~
 557
 558 In previous sections, we noted that Ceph OSD Daemons check each others
 559 heartbeats and report back to the Ceph Monitor. Another thing Ceph OSD daemons
 560 do is called 'peering', which is the process of bringing all of the OSDs that
 561 store a Placement Group (PG) into agreement about the state of all of the
 562 objects (and their metadata) in that PG. In fact, Ceph OSD Daemons `Report
 563 Peering Failure`_ to the Ceph Monitors. Peering issues  usually resolve
 564 themselves; however, if the problem persists, you may need to refer to the
 565 `Troubleshooting Peering Failure`_ section.
 566
 567 .. Note:: Agreeing on the state does not mean that the PGs have the latest contents.
 568
 569 The Ceph Storage Cluster was designed to store at least two copies of an object
 570 (i.e., ``size = 2``), which is the minimum requirement for data safety. For high
 571 availability, a Ceph Storage Cluster should store more than two copies of an object
 572 (e.g., ``size = 3`` and ``min size = 2``) so that it can continue to run in a
 573 ``degraded`` state while maintaining data safety.
 574
 575 Referring back to the diagram in `Smart Daemons Enable Hyperscale`_, we do not
 576 name the Ceph OSD Daemons specifically (e.g., ``osd.0``, ``osd.1``, etc.), but
 577 rather refer to them as *Primary*, *Secondary*, and so forth. By convention,
 578 the *Primary* is the first OSD in the *Acting Set*, and is responsible for
 579 coordinating the peering process for each placement group where it acts as
 580 the *Primary*, and is the **ONLY** OSD that that will accept client-initiated
 581 writes to objects for a given placement group where it acts as the *Primary*.
 582
 583 When a series of OSDs are responsible for a placement group, that series of
 584 OSDs, we refer to them as an *Acting Set*. An *Acting Set* may refer to the Ceph
 585 OSD Daemons that are currently responsible for the placement group, or the Ceph
 586 OSD Daemons that were responsible  for a particular placement group as of some
 587 epoch.
 588
 589 The Ceph OSD daemons that are part of an *Acting Set* may not always be  ``up``.
 590 When an OSD in the *Acting Set* is ``up``, it is part of the  *Up Set*. The *Up
 591 Set* is an important distinction, because Ceph can remap PGs to other Ceph OSD
 592 Daemons when an OSD fails.
 593
 594 .. note:: In an *Acting Set* for a PG containing ``osd.25``, ``osd.32`` and
 595    ``osd.61``, the first OSD, ``osd.25``, is the *Primary*. If that OSD fails,
 596    the Secondary, ``osd.32``, becomes the *Primary*, and ``osd.25`` will be
 597    removed from the *Up Set*.
 598
 599
 600 .. index:: architecture; Rebalancing
 601
 602 Rebalancing
 603 ~~~~~~~~~~~
 604
 605 When you add a Ceph OSD Daemon to a Ceph Storage Cluster, the cluster map gets
 606 updated with the new OSD. Referring back to `Calculating PG IDs`_, this changes
 607 the cluster map. Consequently, it changes object placement, because it changes
 608 an input for the calculations. The following diagram depicts the rebalancing
 609 process (albeit rather crudely, since it is substantially less impactful with
 610 large clusters) where some, but not all of the PGs migrate from existing OSDs
 611 (OSD 1, and OSD 2) to the new OSD (OSD 3). Even when rebalancing, CRUSH is
 612 stable. Many of the placement groups remain in their original configuration,
 613 and each OSD gets some added capacity, so there are no load spikes on the
 614 new OSD after rebalancing is complete.
 615
 616
 617 .. ditaa::
 618            +--------+     +--------+
 619    Before  |  OSD 1 |     |  OSD 2 |
 620            +--------+     +--------+
 621            |  PG #1 |     | PG #6  |
 622            |  PG #2 |     | PG #7  |
 623            |  PG #3 |     | PG #8  |
 624            |  PG #4 |     | PG #9  |
 625            |  PG #5 |     | PG #10 |
 626            +--------+     +--------+
 627
 628            +--------+     +--------+     +--------+
 629     After  |  OSD 1 |     |  OSD 2 |     |  OSD 3 |
 630            +--------+     +--------+     +--------+
 631            |  PG #1 |     | PG #7  |     |  PG #3 |
 632            |  PG #2 |     | PG #8  |     |  PG #6 |
 633            |  PG #4 |     | PG #10 |     |  PG #9 |
 634            |  PG #5 |     |        |     |        |
 635            |        |     |        |     |        |
 636            +--------+     +--------+     +--------+
 637
 638
 639 .. index:: architecture; Data Scrubbing
 640
 641 Data Consistency
 642 ~~~~~~~~~~~~~~~~
 643
 644 As part of maintaining data consistency and cleanliness, Ceph OSDs can also
 645 scrub objects within placement groups. That is, Ceph OSDs can compare object
 646 metadata in one placement group with its replicas in placement groups stored in
 647 other OSDs. Scrubbing (usually performed daily) catches OSD bugs or filesystem
 648 errors.  OSDs can also perform deeper scrubbing by comparing data in objects
 649 bit-for-bit.  Deep scrubbing (usually performed weekly) finds bad sectors on a
 650 disk that weren't apparent in a light scrub.
 651
 652 See `Data Scrubbing`_ for details on configuring scrubbing.
 653
 654
 655
 656
 657
 658 .. index:: erasure coding
 659
 660 Erasure Coding
 661 --------------
 662
 663 An erasure coded pool stores each object as ``K+M`` chunks. It is divided into
 664 ``K`` data chunks and ``M`` coding chunks. The pool is configured to have a size
 665 of ``K+M`` so that each chunk is stored in an OSD in the acting set. The rank of
 666 the chunk is stored as an attribute of the object.
 667
 668 For instance an erasure coded pool is created to use five OSDs (``K+M = 5``) and
 669 sustain the loss of two of them (``M = 2``).
 670
 671 Reading and Writing Encoded Chunks
 672 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 673
 674 When the object **NYAN** containing ``ABCDEFGHI`` is written to the pool, the erasure
 675 encoding function splits the content into three data chunks simply by dividing
 676 the content in three: the first contains ``ABC``, the second ``DEF`` and the
 677 last ``GHI``. The content will be padded if the content length is not a multiple
 678 of ``K``. The function also creates two coding chunks: the fourth with ``YXY``
 679 and the fifth with ``GQC``. Each chunk is stored in an OSD in the acting set.
 680 The chunks are stored in objects that have the same name (**NYAN**) but reside
 681 on different OSDs. The order in which the chunks were created must be preserved
 682 and is stored as an attribute of the object (``shard_t``), in addition to its
 683 name. Chunk 1 contains ``ABC`` and is stored on **OSD5** while chunk 4 contains
 684 ``YXY`` and is stored on **OSD3**.
 685
 686
 687 .. ditaa::
 688                             +-------------------+
 689                        name |       NYAN        |
 690                             +-------------------+
 691                     content |     ABCDEFGHI     |
 692                             +--------+----------+
 693                                      |
 694                                      |
 695                                      v
 696                               +------+------+
 697               +---------------+ encode(3,2) +-----------+
 698               |               +--+--+---+---+           |
 699               |                  |  |   |               |
 700               |          +-------+  |   +-----+         |
 701               |          |          |         |         |
 702            +--v---+   +--v---+   +--v---+  +--v---+  +--v---+
 703      name  | NYAN |   | NYAN |   | NYAN |  | NYAN |  | NYAN |
 704            +------+   +------+   +------+  +------+  +------+
 705     shard  |  1   |   |  2   |   |  3   |  |  4   |  |  5   |
 706            +------+   +------+   +------+  +------+  +------+
 707   content  | ABC  |   | DEF  |   | GHI  |  | YXY  |  | QGC  |
 708            +--+---+   +--+---+   +--+---+  +--+---+  +--+---+
 709               |          |          |         |         |
 710               |          |          v         |         |
 711               |          |       +--+---+     |         |
 712               |          |       | OSD1 |     |         |
 713               |          |       +------+     |         |
 714               |          |                    |         |
 715               |          |       +------+     |         |
 716               |          +------>| OSD2 |     |         |
 717               |                  +------+     |         |
 718               |                               |         |
 719               |                  +------+     |         |
 720               |                  | OSD3 |<----+         |
 721               |                  +------+               |
 722               |                                         |
 723               |                  +------+               |
 724               |                  | OSD4 |<--------------+
 725               |                  +------+
 726               |
 727               |                  +------+
 728               +----------------->| OSD5 |
 729                                  +------+
 730
 731
 732 When the object **NYAN** is read from the erasure coded pool, the decoding
 733 function reads three chunks: chunk 1 containing ``ABC``, chunk 3 containing
 734 ``GHI`` and chunk 4 containing ``YXY``. Then, it rebuilds the original content
 735 of the object ``ABCDEFGHI``. The decoding function is informed that the chunks 2
 736 and 5 are missing (they are called 'erasures'). The chunk 5 could not be read
 737 because the **OSD4** is out. The decoding function can be called as soon as
 738 three chunks are read: **OSD2** was the slowest and its chunk was not taken into
 739 account.
 740
 741 .. ditaa::
 742                                  +-------------------+
 743                             name |       NYAN        |
 744                                  +-------------------+
 745                          content |     ABCDEFGHI     |
 746                                  +---------+---------+
 747                                            ^
 748                                            |
 749                                            |
 750                                    +-------+-------+
 751                                    |  decode(3,2)  |
 752                     +------------->+  erasures 2,5 +<-+
 753                     |              |               |  |
 754                     |              +-------+-------+  |
 755                     |                      ^          |
 756                     |                      |          |
 757                     |                      |          |
 758                  +--+---+   +------+   +---+--+   +---+--+
 759            name  | NYAN |   | NYAN |   | NYAN |   | NYAN |
 760                  +------+   +------+   +------+   +------+
 761           shard  |  1   |   |  2   |   |  3   |   |  4   |
 762                  +------+   +------+   +------+   +------+
 763         content  | ABC  |   | DEF  |   | GHI  |   | YXY  |
 764                  +--+---+   +--+---+   +--+---+   +--+---+
 765                     ^          .          ^          ^
 766                     |    TOO   .          |          |
 767                     |    SLOW  .       +--+---+      |
 768                     |          ^       | OSD1 |      |
 769                     |          |       +------+      |
 770                     |          |                     |
 771                     |          |       +------+      |
 772                     |          +-------| OSD2 |      |
 773                     |                  +------+      |
 774                     |                                |
 775                     |                  +------+      |
 776                     |                  | OSD3 |------+
 777                     |                  +------+
 778                     |
 779                     |                  +------+
 780                     |                  | OSD4 | OUT
 781                     |                  +------+
 782                     |
 783                     |                  +------+
 784                     +------------------| OSD5 |
 785                                        +------+
 786
 787
 788 Interrupted Full Writes
 789 ~~~~~~~~~~~~~~~~~~~~~~~
 790
 791 In an erasure coded pool, the primary OSD in the up set receives all write
 792 operations. It is responsible for encoding the payload into ``K+M`` chunks and
 793 sends them to the other OSDs. It is also responsible for maintaining an
 794 authoritative version of the placement group logs.
 795
 796 In the following diagram, an erasure coded placement group has been created with
 797 ``K = 2 + M = 1`` and is supported by three OSDs, two for ``K`` and one for
 798 ``M``. The acting set of the placement group is made of **OSD 1**, **OSD 2** and
 799 **OSD 3**. An object has been encoded and stored in the OSDs : the chunk
 800 ``D1v1`` (i.e. Data chunk number 1, version 1) is on **OSD 1**, ``D2v1`` on
 801 **OSD 2** and ``C1v1`` (i.e. Coding chunk number 1, version 1) on **OSD 3**. The
 802 placement group logs on each OSD are identical (i.e. ``1,1`` for epoch 1,
 803 version 1).
 804
 805
 806 .. ditaa::
 807      Primary OSD
 808
 809    +-------------+
 810    |    OSD 1    |             +-------------+
 811    |         log |  Write Full |             |
 812    |  +----+     |<------------+ Ceph Client |
 813    |  |D1v1| 1,1 |      v1     |             |
 814    |  +----+     |             +-------------+
 815    +------+------+
 816           |
 817           |
 818           |          +-------------+
 819           |          |    OSD 2    |
 820           |          |         log |
 821           +--------->+  +----+     |
 822           |          |  |D2v1| 1,1 |
 823           |          |  +----+     |
 824           |          +-------------+
 825           |
 826           |          +-------------+
 827           |          |    OSD 3    |
 828           |          |         log |
 829           +--------->|  +----+     |
 830                      |  |C1v1| 1,1 |
 831                      |  +----+     |
 832                      +-------------+
 833
 834 **OSD 1** is the primary and receives a **WRITE FULL** from a client, which
 835 means the payload is to replace the object entirely instead of overwriting a
 836 portion of it. Version 2 (v2) of the object is created to override version 1
 837 (v1). **OSD 1** encodes the payload into three chunks: ``D1v2`` (i.e. Data
 838 chunk number 1 version 2) will be on **OSD 1**, ``D2v2`` on **OSD 2** and
 839 ``C1v2`` (i.e. Coding chunk number 1 version 2) on **OSD 3**. Each chunk is sent
 840 to the target OSD, including the primary OSD which is responsible for storing
 841 chunks in addition to handling write operations and maintaining an authoritative
 842 version of the placement group logs. When an OSD receives the message
 843 instructing it to write the chunk, it also creates a new entry in the placement
 844 group logs to reflect the change. For instance, as soon as **OSD 3** stores
 845 ``C1v2``, it adds the entry ``1,2`` ( i.e. epoch 1, version 2 ) to its logs.
 846 Because the OSDs work asynchronously, some chunks may still be in flight ( such
 847 as ``D2v2`` ) while others are acknowledged and on disk ( such as ``C1v1`` and
 848 ``D1v1``).
 849
 850 .. ditaa::
 851
 852      Primary OSD
 853
 854    +-------------+
 855    |    OSD 1    |
 856    |         log |
 857    |  +----+     |             +-------------+
 858    |  |D1v2| 1,2 |  Write Full |             |
 859    |  +----+     +<------------+ Ceph Client |
 860    |             |      v2     |             |
 861    |  +----+     |             +-------------+
 862    |  |D1v1| 1,1 |
 863    |  +----+     |
 864    +------+------+
 865           |
 866           |
 867           |           +------+------+
 868           |           |    OSD 2    |
 869           |  +------+ |         log |
 870           +->| D2v2 | |  +----+     |
 871           |  +------+ |  |D2v1| 1,1 |
 872           |           |  +----+     |
 873           |           +-------------+
 874           |
 875           |           +-------------+
 876           |           |    OSD 3    |
 877           |           |         log |
 878           |           |  +----+     |
 879           |           |  |C1v2| 1,2 |
 880           +---------->+  +----+     |
 881                       |             |
 882                       |  +----+     |
 883                       |  |C1v1| 1,1 |
 884                       |  +----+     |
 885                       +-------------+
 886
 887
 888 If all goes well, the chunks are acknowledged on each OSD in the acting set and
 889 the logs' ``last_complete`` pointer can move from ``1,1`` to ``1,2``.
 890
 891 .. ditaa::
 892
 893      Primary OSD
 894
 895    +-------------+
 896    |    OSD 1    |
 897    |         log |
 898    |  +----+     |             +-------------+
 899    |  |D1v2| 1,2 |  Write Full |             |
 900    |  +----+     +<------------+ Ceph Client |
 901    |             |      v2     |             |
 902    |  +----+     |             +-------------+
 903    |  |D1v1| 1,1 |
 904    |  +----+     |
 905    +------+------+
 906           |
 907           |           +-------------+
 908           |           |    OSD 2    |
 909           |           |         log |
 910           |           |  +----+     |
 911           |           |  |D2v2| 1,2 |
 912           +---------->+  +----+     |
 913           |           |             |
 914           |           |  +----+     |
 915           |           |  |D2v1| 1,1 |
 916           |           |  +----+     |
 917           |           +-------------+
 918           |
 919           |           +-------------+
 920           |           |    OSD 3    |
 921           |           |         log |
 922           |           |  +----+     |
 923           |           |  |C1v2| 1,2 |
 924           +---------->+  +----+     |
 925                       |             |
 926                       |  +----+     |
 927                       |  |C1v1| 1,1 |
 928                       |  +----+     |
 929                       +-------------+
 930
 931
 932 Finally, the files used to store the chunks of the previous version of the
 933 object can be removed: ``D1v1`` on **OSD 1**, ``D2v1`` on **OSD 2** and ``C1v1``
 934 on **OSD 3**.
 935
 936 .. ditaa::
 937      Primary OSD
 938
 939    +-------------+
 940    |    OSD 1    |
 941    |         log |
 942    |  +----+     |
 943    |  |D1v2| 1,2 |
 944    |  +----+     |
 945    +------+------+
 946           |
 947           |
 948           |          +-------------+
 949           |          |    OSD 2    |
 950           |          |         log |
 951           +--------->+  +----+     |
 952           |          |  |D2v2| 1,2 |
 953           |          |  +----+     |
 954           |          +-------------+
 955           |
 956           |          +-------------+
 957           |          |    OSD 3    |
 958           |          |         log |
 959           +--------->|  +----+     |
 960                      |  |C1v2| 1,2 |
 961                      |  +----+     |
 962                      +-------------+
 963
 964
 965 But accidents happen. If **OSD 1** goes down while ``D2v2`` is still in flight,
 966 the object's version 2 is partially written: **OSD 3** has one chunk but that is
 967 not enough to recover. It lost two chunks: ``D1v2`` and ``D2v2`` and the
 968 erasure coding parameters ``K = 2``, ``M = 1`` require that at least two chunks are
 969 available to rebuild the third. **OSD 4** becomes the new primary and finds that
 970 the ``last_complete`` log entry (i.e., all objects before this entry were known
 971 to be available on all OSDs in the previous acting set ) is ``1,1`` and that
 972 will be the head of the new authoritative log.
 973
 974 .. ditaa::
 975    +-------------+
 976    |    OSD 1    |
 977    |   (down)    |
 978    | c333        |
 979    +------+------+
 980           |
 981           |           +-------------+
 982           |           |    OSD 2    |
 983           |           |         log |
 984           |           |  +----+     |
 985           +---------->+  |D2v1| 1,1 |
 986           |           |  +----+     |
 987           |           |             |
 988           |           +-------------+
 989           |
 990           |           +-------------+
 991           |           |    OSD 3    |
 992           |           |         log |
 993           |           |  +----+     |
 994           |           |  |C1v2| 1,2 |
 995           +---------->+  +----+     |
 996                       |             |
 997                       |  +----+     |
 998                       |  |C1v1| 1,1 |
 999                       |  +----+     |
1000                       +-------------+
1001      Primary OSD
1002    +-------------+
1003    |    OSD 4    |
1004    |         log |
1005    |             |
1006    |         1,1 |
1007    |             |
1008    +------+------+
1009
1010
1011
1012 The log entry 1,2 found on **OSD 3** is divergent from the new authoritative log
1013 provided by **OSD 4**: it is discarded and the file containing the ``C1v2``
1014 chunk is removed. The ``D1v1`` chunk is rebuilt with the ``decode`` function of
1015 the erasure coding library during scrubbing and stored on the new primary
1016 **OSD 4**.
1017
1018
1019 .. ditaa::
1020      Primary OSD
1021
1022    +-------------+
1023    |    OSD 4    |
1024    |         log |
1025    |  +----+     |
1026    |  |D1v1| 1,1 |
1027    |  +----+     |
1028    +------+------+
1029           ^
1030           |
1031           |          +-------------+
1032           |          |    OSD 2    |
1033           |          |         log |
1034           +----------+  +----+     |
1035           |          |  |D2v1| 1,1 |
1036           |          |  +----+     |
1037           |          +-------------+
1038           |
1039           |          +-------------+
1040           |          |    OSD 3    |
1041           |          |         log |
1042           +----------|  +----+     |
1043                      |  |C1v1| 1,1 |
1044                      |  +----+     |
1045                      +-------------+
1046
1047    +-------------+
1048    |    OSD 1    |
1049    |   (down)    |
1050    | c333        |
1051    +-------------+
1052
1053 See `Erasure Code Notes`_ for additional details.
1054
1055
1056
1057 Cache Tiering
1058 -------------
1059
1060 A cache tier provides Ceph Clients with better I/O performance for a subset of
1061 the data stored in a backing storage tier. Cache tiering involves creating a
1062 pool of relatively fast/expensive storage devices (e.g., solid state drives)
1063 configured to act as a cache tier, and a backing pool of either erasure-coded
1064 or relatively slower/cheaper devices configured to act as an economical storage
1065 tier. The Ceph objecter handles where to place the objects and the tiering
1066 agent determines when to flush objects from the cache to the backing storage
1067 tier. So the cache tier and the backing storage tier are completely transparent
1068 to Ceph clients.
1069
1070
1071 .. ditaa::
1072            +-------------+
1073            | Ceph Client |
1074            +------+------+
1075                   ^
1076      Tiering is   |
1077     Transparent   |              Faster I/O
1078         to Ceph   |           +---------------+
1079      Client Ops   |           |               |
1080                   |    +----->+   Cache Tier  |
1081                   |    |      |               |
1082                   |    |      +-----+---+-----+
1083                   |    |            |   ^
1084                   v    v            |   |   Active Data in Cache Tier
1085            +------+----+--+         |   |
1086            |   Objecter   |         |   |
1087            +-----------+--+         |   |
1088                        ^            |   |   Inactive Data in Storage Tier
1089                        |            v   |
1090                        |      +-----+---+-----+
1091                        |      |               |
1092                        +----->|  Storage Tier |
1093                               |               |
1094                               +---------------+
1095                                  Slower I/O
1096
1097 See `Cache Tiering`_ for additional details.
1098
1099
1100 .. index:: Extensibility, Ceph Classes
1101
1102 Extending Ceph
1103 --------------
1104
1105 You can extend Ceph by creating shared object classes called 'Ceph Classes'.
1106 Ceph loads ``.so`` classes stored in the ``osd class dir`` directory dynamically
1107 (i.e., ``$libdir/rados-classes`` by default). When you implement a class, you
1108 can create new object methods that have the ability to call the native methods
1109 in the Ceph Object Store, or other class methods you incorporate via libraries
1110 or create yourself.
1111
1112 On writes, Ceph Classes can call native or class methods, perform any series of
1113 operations on the inbound data and generate a resulting write transaction  that
1114 Ceph will apply atomically.
1115
1116 On reads, Ceph Classes can call native or class methods, perform any series of
1117 operations on the outbound data and return the data to the client.
1118
1119 .. topic:: Ceph Class Example
1120
1121    A Ceph class for a content management system that presents pictures of a
1122    particular size and aspect ratio could take an inbound bitmap image, crop it
1123    to a particular aspect ratio, resize it and embed an invisible copyright or
1124    watermark to help protect the intellectual property; then, save the
1125    resulting bitmap image to the object store.
1126
1127 See ``src/objclass/objclass.h``, ``src/fooclass.cc`` and ``src/barclass`` for
1128 exemplary implementations.
1129
1130
1131 Summary
1132 -------
1133
1134 Ceph Storage Clusters are dynamic--like a living organism. Whereas, many storage
1135 appliances do not fully utilize the CPU and RAM of a typical commodity server,
1136 Ceph does. From heartbeats, to  peering, to rebalancing the cluster or
1137 recovering from faults,  Ceph offloads work from clients (and from a centralized
1138 gateway which doesn't exist in the Ceph architecture) and uses the computing
1139 power of the OSDs to perform the work. When referring to `Hardware
1140 Recommendations`_ and the `Network Config Reference`_,  be cognizant of the
1141 foregoing concepts to understand how Ceph utilizes computing resources.
1142
1143 .. index:: Ceph Protocol, librados
1144
1145 Ceph Protocol
1146 =============
1147
1148 Ceph Clients use the native protocol for interacting with the Ceph Storage
1149 Cluster. Ceph packages this functionality into the ``librados`` library so that
1150 you can create your own custom Ceph Clients. The following diagram depicts the
1151 basic architecture.
1152
1153 .. ditaa::
1154             +---------------------------------+
1155             |  Ceph Storage Cluster Protocol  |
1156             |           (librados)            |
1157             +---------------------------------+
1158             +---------------+ +---------------+
1159             |      OSDs     | |    Monitors   |
1160             +---------------+ +---------------+
1161
1162
1163 Native Protocol and ``librados``
1164 --------------------------------
1165
1166 Modern applications need a simple object storage interface with asynchronous
1167 communication capability. The Ceph Storage Cluster provides a simple object
1168 storage interface with asynchronous communication capability. The interface
1169 provides direct, parallel access to objects throughout the cluster.
1170
1171
1172 - Pool Operations
1173 - Snapshots and Copy-on-write Cloning
1174 - Read/Write Objects
1175   - Create or Remove
1176   - Entire Object or Byte Range
1177   - Append or Truncate
1178 - Create/Set/Get/Remove XATTRs
1179 - Create/Set/Get/Remove Key/Value Pairs
1180 - Compound operations and dual-ack semantics
1181 - Object Classes
1182
1183
1184 .. index:: architecture; watch/notify
1185
1186 Object Watch/Notify
1187 -------------------
1188
1189 A client can register a persistent interest with an object and keep a session to
1190 the primary OSD open. The client can send a notification message and a payload to
1191 all watchers and receive notification when the watchers receive the
1192 notification. This enables a client to use any object as a
1193 synchronization/communication channel.
1194
1195
1196 .. ditaa:: +----------+     +----------+     +----------+     +---------------+
1197            | Client 1 |     | Client 2 |     | Client 3 |     | OSD:Object ID |
1198            +----------+     +----------+     +----------+     +---------------+
1199                  |                |                |                  |
1200                  |                |                |                  |
1201                  |                |  Watch Object  |                  |
1202                  |--------------------------------------------------->|
1203                  |                |                |                  |
1204                  |<---------------------------------------------------|
1205                  |                |   Ack/Commit   |                  |
1206                  |                |                |                  |
1207                  |                |  Watch Object  |                  |
1208                  |                |---------------------------------->|
1209                  |                |                |                  |
1210                  |                |<----------------------------------|
1211                  |                |   Ack/Commit   |                  |
1212                  |                |                |   Watch Object   |
1213                  |                |                |----------------->|
1214                  |                |                |                  |
1215                  |                |                |<-----------------|
1216                  |                |                |    Ack/Commit    |
1217                  |                |     Notify     |                  |
1218                  |--------------------------------------------------->|
1219                  |                |                |                  |
1220                  |<---------------------------------------------------|
1221                  |                |     Notify     |                  |
1222                  |                |                |                  |
1223                  |                |<----------------------------------|
1224                  |                |     Notify     |                  |
1225                  |                |                |<-----------------|
1226                  |                |                |      Notify      |
1227                  |                |       Ack      |                  |
1228                  |----------------+---------------------------------->|
1229                  |                |                |                  |
1230                  |                |       Ack      |                  |
1231                  |                +---------------------------------->|
1232                  |                |                |                  |
1233                  |                |                |        Ack       |
1234                  |                |                |----------------->|
1235                  |                |                |                  |
1236                  |<---------------+----------------+------------------|
1237                  |                     Complete
1238
1239 .. index:: architecture; Striping
1240
1241 Data Striping
1242 -------------
1243
1244 Storage devices have throughput limitations, which impact performance and
1245 scalability. So storage systems often support `striping`_--storing sequential
1246 pieces of information across multiple storage devices--to increase throughput
1247 and performance. The most common form of data striping comes from `RAID`_.
1248 The RAID type most similar to Ceph's striping is `RAID 0`_, or a 'striped
1249 volume'. Ceph's striping offers the throughput of RAID 0 striping, the
1250 reliability of n-way RAID mirroring and faster recovery.
1251
1252 Ceph provides three types of clients: Ceph Block Device, Ceph Filesystem, and
1253 Ceph Object Storage. A Ceph Client converts its data from the representation
1254 format it provides to its users (a block device image, RESTful objects, CephFS
1255 filesystem directories) into objects for storage in the Ceph Storage Cluster.
1256
1257 .. tip:: The objects Ceph stores in the Ceph Storage Cluster are not striped.
1258    Ceph Object Storage, Ceph Block Device, and the Ceph Filesystem stripe their
1259    data over multiple Ceph Storage Cluster objects. Ceph Clients that write
1260    directly to the Ceph Storage Cluster via ``librados`` must perform the
1261    striping (and parallel I/O) for themselves to obtain these benefits.
1262
1263 The simplest Ceph striping format involves a stripe count of 1 object. Ceph
1264 Clients write stripe units to a Ceph Storage Cluster object until the object is
1265 at its maximum capacity, and then create another object for additional stripes
1266 of data. The simplest form of striping may be sufficient for small block device
1267 images, S3 or Swift objects and CephFS files. However, this simple form doesn't
1268 take maximum advantage of Ceph's ability to distribute data across placement
1269 groups, and consequently doesn't improve performance very much. The following
1270 diagram depicts the simplest form of striping:
1271
1272 .. ditaa::
1273                         +---------------+
1274                         |  Client Data  |
1275                         |     Format    |
1276                         | cCCC          |
1277                         +---------------+
1278                                 |
1279                        +--------+-------+
1280                        |                |
1281                        v                v
1282                  /-----------\    /-----------\
1283                  | Begin cCCC|    | Begin cCCC|
1284                  | Object  0 |    | Object  1 |
1285                  +-----------+    +-----------+
1286                  |  stripe   |    |  stripe   |
1287                  |  unit 1   |    |  unit 5   |
1288                  +-----------+    +-----------+
1289                  |  stripe   |    |  stripe   |
1290                  |  unit 2   |    |  unit 6   |
1291                  +-----------+    +-----------+
1292                  |  stripe   |    |  stripe   |
1293                  |  unit 3   |    |  unit 7   |
1294                  +-----------+    +-----------+
1295                  |  stripe   |    |  stripe   |
1296                  |  unit 4   |    |  unit 8   |
1297                  +-----------+    +-----------+
1298                  | End cCCC  |    | End cCCC  |
1299                  | Object 0  |    | Object 1  |
1300                  \-----------/    \-----------/
1301
1302
1303 If you anticipate large images sizes, large S3 or Swift objects (e.g., video),
1304 or large CephFS directories, you may see considerable read/write performance
1305 improvements by striping client data over multiple objects within an object set.
1306 Significant write performance occurs when the client writes the stripe units to
1307 their corresponding objects in parallel. Since objects get mapped to different
1308 placement groups and further mapped to different OSDs, each write occurs in
1309 parallel at the maximum write speed. A write to a single disk would be limited
1310 by the head movement (e.g. 6ms per seek) and bandwidth of that one device (e.g.
1311 100MB/s).  By spreading that write over multiple objects (which map to different
1312 placement groups and OSDs) Ceph can reduce the number of seeks per drive and
1313 combine the throughput of multiple drives to achieve much faster write (or read)
1314 speeds.
1315
1316 .. note:: Striping is independent of object replicas. Since CRUSH
1317    replicates objects across OSDs, stripes get replicated automatically.
1318
1319 In the following diagram, client data gets striped across an object set
1320 (``object set 1`` in the following diagram) consisting of 4 objects, where the
1321 first stripe unit is ``stripe unit 0`` in ``object 0``, and the fourth stripe
1322 unit is ``stripe unit 3`` in ``object 3``. After writing the fourth stripe, the
1323 client determines if the object set is full. If the object set is not full, the
1324 client begins writing a stripe to the first object again (``object 0`` in the
1325 following diagram). If the object set is full, the client creates a new object
1326 set (``object set 2`` in the following diagram), and begins writing to the first
1327 stripe (``stripe unit 16``) in the first object in the new object set (``object
1328 4`` in the diagram below).
1329
1330 .. ditaa::
1331                           +---------------+
1332                           |  Client Data  |
1333                           |     Format    |
1334                           | cCCC          |
1335                           +---------------+
1336                                   |
1337        +-----------------+--------+--------+-----------------+
1338        |                 |                 |                 |     +--\
1339        v                 v                 v                 v        |
1340  /-----------\     /-----------\     /-----------\     /-----------\  |
1341  | Begin cCCC|     | Begin cCCC|     | Begin cCCC|     | Begin cCCC|  |
1342  | Object 0  |     | Object  1 |     | Object  2 |     | Object  3 |  |
1343  +-----------+     +-----------+     +-----------+     +-----------+  |
1344  |  stripe   |     |  stripe   |     |  stripe   |     |  stripe   |  |
1345  |  unit 0   |     |  unit 1   |     |  unit 2   |     |  unit 3   |  |
1346  +-----------+     +-----------+     +-----------+     +-----------+  |
1347  |  stripe   |     |  stripe   |     |  stripe   |     |  stripe   |  +-\
1348  |  unit 4   |     |  unit 5   |     |  unit 6   |     |  unit 7   |    | Object
1349  +-----------+     +-----------+     +-----------+     +-----------+    +- Set
1350  |  stripe   |     |  stripe   |     |  stripe   |     |  stripe   |    |   1
1351  |  unit 8   |     |  unit 9   |     |  unit 10  |     |  unit 11  |  +-/
1352  +-----------+     +-----------+     +-----------+     +-----------+  |
1353  |  stripe   |     |  stripe   |     |  stripe   |     |  stripe   |  |
1354  |  unit 12  |     |  unit 13  |     |  unit 14  |     |  unit 15  |  |
1355  +-----------+     +-----------+     +-----------+     +-----------+  |
1356  | End cCCC  |     | End cCCC  |     | End cCCC  |     | End cCCC  |  |
1357  | Object 0  |     | Object 1  |     | Object 2  |     | Object 3  |  |
1358  \-----------/     \-----------/     \-----------/     \-----------/  |
1359                                                                       |
1360                                                                    +--/
1361
1362                                                                    +--\
1363                                                                       |
1364  /-----------\     /-----------\     /-----------\     /-----------\  |
1365  | Begin cCCC|     | Begin cCCC|     | Begin cCCC|     | Begin cCCC|  |
1366  | Object  4 |     | Object  5 |     | Object  6 |     | Object  7 |  |
1367  +-----------+     +-----------+     +-----------+     +-----------+  |
1368  |  stripe   |     |  stripe   |     |  stripe   |     |  stripe   |  |
1369  |  unit 16  |     |  unit 17  |     |  unit 18  |     |  unit 19  |  |
1370  +-----------+     +-----------+     +-----------+     +-----------+  |
1371  |  stripe   |     |  stripe   |     |  stripe   |     |  stripe   |  +-\
1372  |  unit 20  |     |  unit 21  |     |  unit 22  |     |  unit 23  |    | Object
1373  +-----------+     +-----------+     +-----------+     +-----------+    +- Set
1374  |  stripe   |     |  stripe   |     |  stripe   |     |  stripe   |    |   2
1375  |  unit 24  |     |  unit 25  |     |  unit 26  |     |  unit 27  |  +-/
1376  +-----------+     +-----------+     +-----------+     +-----------+  |
1377  |  stripe   |     |  stripe   |     |  stripe   |     |  stripe   |  |
1378  |  unit 28  |     |  unit 29  |     |  unit 30  |     |  unit 31  |  |
1379  +-----------+     +-----------+     +-----------+     +-----------+  |
1380  | End cCCC  |     | End cCCC  |     | End cCCC  |     | End cCCC  |  |
1381  | Object 4  |     | Object 5  |     | Object 6  |     | Object 7  |  |
1382  \-----------/     \-----------/     \-----------/     \-----------/  |
1383                                                                       |
1384                                                                    +--/
1385
1386 Three important variables determine how Ceph stripes data:
1387
1388 - **Object Size:** Objects in the Ceph Storage Cluster have a maximum
1389   configurable size (e.g., 2MB, 4MB, etc.). The object size should be large
1390   enough to accommodate many stripe units, and should be a multiple of
1391   the stripe unit.
1392
1393 - **Stripe Width:** Stripes have a configurable unit size (e.g., 64kb).
1394   The Ceph Client divides the data it will write to objects into equally
1395   sized stripe units, except for the last stripe unit. A stripe width,
1396   should be a fraction of the Object Size so that an object may contain
1397   many stripe units.
1398
1399 - **Stripe Count:** The Ceph Client writes a sequence of stripe units
1400   over a series of objects determined by the stripe count. The series
1401   of objects is called an object set. After the Ceph Client writes to
1402   the last object in the object set, it returns to the first object in
1403   the object set.
1404
1405 .. important:: Test the performance of your striping configuration before
1406    putting your cluster into production. You CANNOT change these striping
1407    parameters after you stripe the data and write it to objects.
1408
1409 Once the Ceph Client has striped data to stripe units and mapped the stripe
1410 units to objects, Ceph's CRUSH algorithm maps the objects to placement groups,
1411 and the placement groups to Ceph OSD Daemons before the objects are stored as
1412 files on a storage disk.
1413
1414 .. note:: Since a client writes to a single pool, all data striped into objects
1415    get mapped to placement groups in the same pool. So they use the same CRUSH
1416    map and the same access controls.
1417
1418
1419 .. index:: architecture; Ceph Clients
1420
1421 Ceph Clients
1422 ============
1423
1424 Ceph Clients include a number of service interfaces. These include:
1425
1426 - **Block Devices:** The :term:`Ceph Block Device` (a.k.a., RBD) service
1427   provides resizable, thin-provisioned block devices with snapshotting and
1428   cloning. Ceph stripes a block device across the cluster for high
1429   performance. Ceph supports both kernel objects (KO) and a QEMU hypervisor
1430   that uses ``librbd`` directly--avoiding the kernel object overhead for
1431   virtualized systems.
1432
1433 - **Object Storage:** The :term:`Ceph Object Storage` (a.k.a., RGW) service
1434   provides RESTful APIs with interfaces that are compatible with Amazon S3
1435   and OpenStack Swift.
1436
1437 - **Filesystem**: The :term:`Ceph Filesystem` (CephFS) service provides
1438   a POSIX compliant filesystem usable with ``mount`` or as
1439   a filesytem in user space (FUSE).
1440
1441 Ceph can run additional instances of OSDs, MDSs, and monitors for scalability
1442 and high availability. The following diagram depicts the high-level
1443 architecture.
1444
1445 .. ditaa::
1446             +--------------+  +----------------+  +-------------+
1447             | Block Device |  | Object Storage |  |   Ceph FS   |
1448             +--------------+  +----------------+  +-------------+
1449
1450             +--------------+  +----------------+  +-------------+
1451             |    librbd    |  |     librgw     |  |  libcephfs  |
1452             +--------------+  +----------------+  +-------------+
1453
1454             +---------------------------------------------------+
1455             |      Ceph Storage Cluster Protocol (librados)     |
1456             +---------------------------------------------------+
1457
1458             +---------------+ +---------------+ +---------------+
1459             |      OSDs     | |      MDSs     | |    Monitors   |
1460             +---------------+ +---------------+ +---------------+
1461
1462
1463 .. index:: architecture; Ceph Object Storage
1464
1465 Ceph Object Storage
1466 -------------------
1467
1468 The Ceph Object Storage daemon, ``radosgw``, is a FastCGI service that provides
1469 a RESTful_ HTTP API to store objects and metadata. It layers on top of the Ceph
1470 Storage Cluster with its own data formats, and maintains its own user database,
1471 authentication, and access control. The RADOS Gateway uses a unified namespace,
1472 which means you can use either the OpenStack Swift-compatible API or the Amazon
1473 S3-compatible API. For example, you can write data using the S3-compatible API
1474 with one application and then read data using the Swift-compatible API with
1475 another application.
1476
1477 .. topic:: S3/Swift Objects and Store Cluster Objects Compared
1478
1479    Ceph's Object Storage uses the term *object* to describe the data it stores.
1480    S3 and Swift objects are not the same as the objects that Ceph writes to the
1481    Ceph Storage Cluster. Ceph Object Storage objects are mapped to Ceph Storage
1482    Cluster objects. The S3 and Swift objects do not necessarily
1483    correspond in a 1:1 manner with an object stored in the storage cluster. It
1484    is possible for an S3 or Swift object to map to multiple Ceph objects.
1485
1486 See `Ceph Object Storage`_ for details.
1487
1488
1489 .. index:: Ceph Block Device; block device; RBD; Rados Block Device
1490
1491 Ceph Block Device
1492 -----------------
1493
1494 A Ceph Block Device stripes a block device image over multiple objects in the
1495 Ceph Storage Cluster, where each object gets mapped to a placement group and
1496 distributed, and the placement groups are spread across separate ``ceph-osd``
1497 daemons throughout the cluster.
1498
1499 .. important:: Striping allows RBD block devices to perform better than a single
1500    server could!
1501
1502 Thin-provisioned snapshottable Ceph Block Devices are an attractive option for
1503 virtualization and cloud computing. In virtual machine scenarios, people
1504 typically deploy a Ceph Block Device with the ``rbd`` network storage driver in
1505 QEMU/KVM, where the host machine uses ``librbd`` to provide a block device
1506 service to the guest. Many cloud computing stacks use ``libvirt`` to integrate
1507 with hypervisors. You can use thin-provisioned Ceph Block Devices with QEMU and
1508 ``libvirt`` to support OpenStack and CloudStack among other solutions.
1509
1510 While we do not provide ``librbd`` support with other hypervisors at this time,
1511 you may also use Ceph Block Device kernel objects to provide a block device to a
1512 client. Other virtualization technologies such as Xen can access the Ceph Block
1513 Device kernel object(s). This is done with the  command-line tool ``rbd``.
1514
1515
1516 .. index:: Ceph FS; Ceph Filesystem; libcephfs; MDS; metadata server; ceph-mds
1517
1518 Ceph Filesystem
1519 ---------------
1520
1521 The Ceph Filesystem (Ceph FS) provides a POSIX-compliant filesystem as a
1522 service that is layered on top of the object-based Ceph Storage Cluster.
1523 Ceph FS files get mapped to objects that Ceph stores in the Ceph Storage
1524 Cluster. Ceph Clients mount a CephFS filesystem as a kernel object or as
1525 a Filesystem in User Space (FUSE).
1526
1527 .. ditaa::
1528             +-----------------------+  +------------------------+
1529             | CephFS Kernel Object  |  |      CephFS FUSE       |
1530             +-----------------------+  +------------------------+
1531
1532             +---------------------------------------------------+
1533             |            Ceph FS Library (libcephfs)            |
1534             +---------------------------------------------------+
1535
1536             +---------------------------------------------------+
1537             |      Ceph Storage Cluster Protocol (librados)     |
1538             +---------------------------------------------------+
1539
1540             +---------------+ +---------------+ +---------------+
1541             |      OSDs     | |      MDSs     | |    Monitors   |
1542             +---------------+ +---------------+ +---------------+
1543
1544
1545 The Ceph Filesystem service includes the Ceph Metadata Server (MDS) deployed
1546 with the Ceph Storage cluster. The purpose of the MDS is to store all the
1547 filesystem metadata (directories, file ownership, access modes, etc) in
1548 high-availability Ceph Metadata Servers where the metadata resides in memory.
1549 The reason for the MDS (a daemon called ``ceph-mds``) is that simple filesystem
1550 operations like listing a directory or changing a directory (``ls``, ``cd``)
1551 would tax the Ceph OSD Daemons unnecessarily. So separating the metadata from
1552 the data means that the Ceph Filesystem can provide high performance services
1553 without taxing the Ceph Storage Cluster.
1554
1555 Ceph FS separates the metadata from the data, storing the metadata in the MDS,
1556 and storing the file data in one or more objects in the Ceph Storage Cluster.
1557 The Ceph filesystem aims for POSIX compatibility. ``ceph-mds`` can run as a
1558 single process, or it can be distributed out to multiple physical machines,
1559 either for high availability or for scalability.
1560
1561 - **High Availability**: The extra ``ceph-mds`` instances can be `standby`,
1562   ready to take over the duties of any failed ``ceph-mds`` that was
1563   `active`. This is easy because all the data, including the journal, is
1564   stored on RADOS. The transition is triggered automatically by ``ceph-mon``.
1565
1566 - **Scalability**: Multiple ``ceph-mds`` instances can be `active`, and they
1567   will split the directory tree into subtrees (and shards of a single
1568   busy directory), effectively balancing the load amongst all `active`
1569   servers.
1570
1571 Combinations of `standby` and `active` etc are possible, for example
1572 running 3 `active` ``ceph-mds`` instances for scaling, and one `standby`
1573 instance for high availability.
1574
1575
1576
1577
1578 .. _RADOS - A Scalable, Reliable Storage Service for Petabyte-scale Storage Clusters: https://ceph.com/wp-content/uploads/2016/08/weil-rados-pdsw07.pdf
1579 .. _Paxos: http://en.wikipedia.org/wiki/Paxos_(computer_science)
1580 .. _Monitor Config Reference: ../rados/configuration/mon-config-ref
1581 .. _Monitoring OSDs and PGs: ../rados/operations/monitoring-osd-pg
1582 .. _Heartbeats: ../rados/configuration/mon-osd-interaction
1583 .. _Monitoring OSDs: ../rados/operations/monitoring-osd-pg/#monitoring-osds
1584 .. _CRUSH - Controlled, Scalable, Decentralized Placement of Replicated Data: https://ceph.com/wp-content/uploads/2016/08/weil-crush-sc06.pdf
1585 .. _Data Scrubbing: ../rados/configuration/osd-config-ref#scrubbing
1586 .. _Report Peering Failure: ../rados/configuration/mon-osd-interaction#osds-report-peering-failure
1587 .. _Troubleshooting Peering Failure: ../rados/troubleshooting/troubleshooting-pg#placement-group-down-peering-failure
1588 .. _Ceph Authentication and Authorization: ../rados/operations/auth-intro/
1589 .. _Hardware Recommendations: ../start/hardware-recommendations
1590 .. _Network Config Reference: ../rados/configuration/network-config-ref
1591 .. _Data Scrubbing: ../rados/configuration/osd-config-ref#scrubbing
1592 .. _striping: http://en.wikipedia.org/wiki/Data_striping
1593 .. _RAID: http://en.wikipedia.org/wiki/RAID
1594 .. _RAID 0: http://en.wikipedia.org/wiki/RAID_0#RAID_0
1595 .. _Ceph Object Storage: ../radosgw/
1596 .. _RESTful: http://en.wikipedia.org/wiki/RESTful
1597 .. _Erasure Code Notes: https://github.com/ceph/ceph/blob/40059e12af88267d0da67d8fd8d9cd81244d8f93/doc/dev/osd_internals/erasure_coding/developer_notes.rst
1598 .. _Cache Tiering: ../rados/operations/cache-tiering
1599 .. _Set Pool Values: ../rados/operations/pools#set-pool-values
1600 .. _Kerberos: http://en.wikipedia.org/wiki/Kerberos_(protocol)
1601 .. _Cephx Config Guide: ../rados/configuration/auth-config-ref
1602 .. _User Management: ../rados/operations/user-management