X-Git-Url: https://gerrit.opnfv.org/gerrit/gitweb?a=blobdiff_plain;f=src%2Fceph%2Fdoc%2Farchitecture.rst;fp=src%2Fceph%2Fdoc%2Farchitecture.rst;h=0000000000000000000000000000000000000000;hb=7da45d65be36d36b880cc55c5036e96c24b53f00;hp=f9bdfa2823aba77d3cc550c35627a4e81e3db316;hpb=691462d09d0987b47e112d6ee8740375df3c51b2;p=stor4nfv.git

diff --git a/src/ceph/doc/architecture.rst b/src/ceph/doc/architecture.rst
deleted file mode 100644
index f9bdfa2..0000000
--- a/src/ceph/doc/architecture.rst
+++ /dev/null
@@ -1,1602 +0,0 @@
-==============
- Architecture
-==============
-
-:term:`Ceph` uniquely delivers **object, block, and file storage** in one
-unified system. Ceph is highly reliable, easy to manage, and free. The power of
-Ceph can transform your company's IT infrastructure and your ability to manage
-vast amounts of data. Ceph delivers extraordinary scalabilityâthousands of
-clients accessing petabytes to exabytes of data. A :term:`Ceph Node` leverages
-commodity hardware and intelligent daemons, and a :term:`Ceph Storage Cluster`
-accommodates large numbers of nodes, which communicate with each other to
-replicate and redistribute data dynamically.
-
-.. image:: images/stack.png
-
-
-The Ceph Storage Cluster
-========================
-
-Ceph provides an infinitely scalable :term:`Ceph Storage Cluster` based upon
-:abbr:`RADOS (Reliable Autonomic Distributed Object Store)`, which you can read
-about in `RADOS - A Scalable, Reliable Storage Service for Petabyte-scale
-Storage Clusters`_.
-
-A Ceph Storage Cluster consists of two types of daemons:
-
-- :term:`Ceph Monitor`
-- :term:`Ceph OSD Daemon`
-
-.. ditaa::  +---------------+ +---------------+
-            |      OSDs     | |    Monitors   |
-            +---------------+ +---------------+
-
-A Ceph Monitor maintains a master copy of the cluster map. A cluster of Ceph
-monitors ensures high availability should a monitor daemon fail. Storage cluster
-clients retrieve a copy of the cluster map from the Ceph Monitor.
-
-A Ceph OSD Daemon checks its own state and the state of other OSDs and reports 
-back to monitors.
-
-Storage cluster clients and each :term:`Ceph OSD Daemon` use the CRUSH algorithm
-to efficiently compute information about data location, instead of having to
-depend on a central lookup table. Ceph's high-level features include providing a
-native interface to the Ceph Storage Cluster via ``librados``, and a number of
-service interfaces built on top of ``librados``.
-
-
-
-Storing Data
-------------
-
-The Ceph Storage Cluster receives data from :term:`Ceph Clients`--whether it
-comes through a :term:`Ceph Block Device`, :term:`Ceph Object Storage`, the
-:term:`Ceph Filesystem` or a custom implementation you create using
-``librados``--and it stores the data as objects. Each object corresponds to a
-file in a filesystem, which is stored on an :term:`Object Storage Device`. Ceph
-OSD Daemons handle the read/write operations on the storage disks.
-
-.. ditaa:: /-----\       +-----+       +-----+
-           | obj |------>| {d} |------>| {s} |
-           \-----/       +-----+       +-----+
-   
-            Object         File         Disk
-
-Ceph OSD Daemons store all data as objects in a flat namespace (e.g., no
-hierarchy of directories). An object has an identifier, binary data, and
-metadata consisting of a set of name/value pairs. The semantics are completely
-up to :term:`Ceph Clients`. For example, CephFS uses metadata to store file
-attributes such as the file owner, created date, last modified date, and so
-forth.
-
-
-.. ditaa:: /------+------------------------------+----------------\
-           | ID   | Binary Data                  | Metadata       |
-           +------+------------------------------+----------------+
-           | 1234 | 0101010101010100110101010010 | name1 = value1 | 
-           |      | 0101100001010100110101010010 | name2 = value2 |
-           |      | 0101100001010100110101010010 | nameN = valueN |
-           \------+------------------------------+----------------/    
-
-.. note:: An object ID is unique across the entire cluster, not just the local
-   filesystem.
-
-
-.. index:: architecture; high availability, scalability
-
-Scalability and High Availability
----------------------------------
-
-In traditional architectures, clients talk to a centralized component (e.g., a
-gateway, broker, API, facade, etc.), which acts as a single point of entry to a
-complex subsystem. This imposes a limit to both performance and scalability,
-while introducing a single point of failure (i.e., if the centralized component
-goes down, the whole system goes down, too).
-
-Ceph eliminates the centralized gateway to enable clients to interact with 
-Ceph OSD Daemons directly. Ceph OSD Daemons create object replicas on other
-Ceph Nodes to ensure data safety and high availability. Ceph also uses a cluster
-of monitors to ensure high availability. To eliminate centralization, Ceph 
-uses an algorithm called CRUSH.
-
-
-.. index:: CRUSH; architecture
-
-CRUSH Introduction
-~~~~~~~~~~~~~~~~~~
-
-Ceph Clients and Ceph OSD Daemons both use the :abbr:`CRUSH (Controlled
-Replication Under Scalable Hashing)` algorithm to efficiently compute
-information about object location, instead of having to depend on a
-central lookup table. CRUSH provides a better data management mechanism compared
-to older approaches, and enables massive scale by cleanly distributing the work
-to all the clients and OSD daemons in the cluster. CRUSH uses intelligent data
-replication to ensure resiliency, which is better suited to hyper-scale storage.
-The following sections provide additional details on how CRUSH works. For a
-detailed discussion of CRUSH, see `CRUSH - Controlled, Scalable, Decentralized
-Placement of Replicated Data`_.
-
-.. index:: architecture; cluster map
-
-Cluster Map
-~~~~~~~~~~~
-
-Ceph depends upon Ceph Clients and Ceph OSD Daemons having knowledge of the
-cluster topology, which is inclusive of 5 maps collectively referred to as the
-"Cluster Map":
-
-#. **The Monitor Map:** Contains the cluster ``fsid``, the position, name 
-   address and port of each monitor. It also indicates the current epoch, 
-   when the map was created, and the last time it changed. To view a monitor
-   map, execute ``ceph mon dump``.   
-   
-#. **The OSD Map:** Contains the cluster ``fsid``, when the map was created and
-   last modified, a list of pools, replica sizes, PG numbers, a list of OSDs
-   and their status (e.g., ``up``, ``in``). To view an OSD map, execute
-   ``ceph osd dump``. 
-   
-#. **The PG Map:** Contains the PG version, its time stamp, the last OSD
-   map epoch, the full ratios, and details on each placement group such as
-   the PG ID, the `Up Set`, the `Acting Set`, the state of the PG (e.g., 
-   ``active + clean``), and data usage statistics for each pool.
-
-#. **The CRUSH Map:** Contains a list of storage devices, the failure domain
-   hierarchy (e.g., device, host, rack, row, room, etc.), and rules for 
-   traversing the hierarchy when storing data. To view a CRUSH map, execute
-   ``ceph osd getcrushmap -o {filename}``; then, decompile it by executing
-   ``crushtool -d {comp-crushmap-filename} -o {decomp-crushmap-filename}``.
-   You can view the decompiled map in a text editor or with ``cat``. 
-
-#. **The MDS Map:** Contains the current MDS map epoch, when the map was 
-   created, and the last time it changed. It also contains the pool for 
-   storing metadata, a list of metadata servers, and which metadata servers
-   are ``up`` and ``in``. To view an MDS map, execute ``ceph fs dump``.
-
-Each map maintains an iterative history of its operating state changes. Ceph
-Monitors maintain a master copy of the cluster map including the cluster
-members, state, changes, and the overall health of the Ceph Storage Cluster.
-
-.. index:: high availability; monitor architecture
-
-High Availability Monitors
-~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-Before Ceph Clients can read or write data, they must contact a Ceph Monitor
-to obtain the most recent copy of the cluster map. A Ceph Storage Cluster
-can operate with a single monitor; however, this introduces a single 
-point of failure (i.e., if the monitor goes down, Ceph Clients cannot
-read or write data).
-
-For added reliability and fault tolerance, Ceph supports a cluster of monitors.
-In a cluster of monitors, latency and other faults can cause one or more
-monitors to fall behind the current state of the cluster. For this reason, Ceph
-must have agreement among various monitor instances regarding the state of the
-cluster. Ceph always uses a majority of monitors (e.g., 1, 2:3, 3:5, 4:6, etc.)
-and the `Paxos`_ algorithm to establish a consensus among the monitors about the
-current state of the cluster.
-
-For details on configuring monitors, see the `Monitor Config Reference`_.
-
-.. index:: architecture; high availability authentication
-
-High Availability Authentication
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-To identify users and protect against man-in-the-middle attacks, Ceph provides
-its ``cephx`` authentication system to authenticate users and daemons.
-
-.. note:: The ``cephx`` protocol does not address data encryption in transport 
-   (e.g., SSL/TLS) or encryption at rest.
-
-Cephx uses shared secret keys for authentication, meaning both the client and
-the monitor cluster have a copy of the client's secret key. The authentication
-protocol is such that both parties are able to prove to each other they have a
-copy of the key without actually revealing it. This provides mutual
-authentication, which means the cluster is sure the user possesses the secret
-key, and the user is sure that the cluster has a copy of the secret key.
-
-A key scalability feature of Ceph is to avoid a centralized interface to the
-Ceph object store, which means that Ceph clients must be able to interact with
-OSDs directly. To protect data, Ceph provides its ``cephx`` authentication
-system, which authenticates users operating Ceph clients. The ``cephx`` protocol
-operates in a manner with behavior similar to `Kerberos`_. 
-
-A user/actor invokes a Ceph client to contact a monitor. Unlike Kerberos, each
-monitor can authenticate users and distribute keys, so there is no single point
-of failure or bottleneck when using ``cephx``. The monitor returns an
-authentication data structure similar to a Kerberos ticket that contains a
-session key for use in obtaining Ceph services.  This session key is itself
-encrypted with the user's permanent  secret key, so that only the user can
-request services from the Ceph Monitor(s). The client then uses the session key
-to request its desired services from the monitor, and the monitor provides the
-client with a ticket that will authenticate the client to the OSDs that actually
-handle data. Ceph Monitors and OSDs share a secret, so the client can use the
-ticket provided by the monitor with any OSD or metadata server in the cluster.
-Like Kerberos, ``cephx`` tickets expire, so an attacker cannot use an expired
-ticket or session key obtained surreptitiously. This form of authentication will
-prevent attackers with access to the communications medium from either creating
-bogus messages under another user's identity or altering another user's
-legitimate messages, as long as the user's secret key is not divulged before it
-expires.
-
-To use ``cephx``, an administrator must set up users first. In the following
-diagram, the ``client.admin`` user invokes  ``ceph auth get-or-create-key`` from
-the command line to generate a username and secret key. Ceph's ``auth``
-subsystem generates the username and key, stores a copy with the monitor(s) and
-transmits the user's secret back to the ``client.admin`` user. This means that 
-the client and the monitor share a secret key.
-
-.. note:: The ``client.admin`` user must provide the user ID and 
-   secret key to the user in a secure manner. 
-
-.. ditaa:: +---------+     +---------+
-           | Client  |     | Monitor |
-           +---------+     +---------+
-                |  request to   |
-                | create a user |
-                |-------------->|----------+ create user
-                |               |          | and                 
-                |<--------------|<---------+ store key
-                | transmit key  |
-                |               |
-
-
-To authenticate with the monitor, the client passes in the user name to the
-monitor, and the monitor generates a session key and encrypts it with the secret
-key associated to the user name. Then, the monitor transmits the encrypted
-ticket back to the client. The client then decrypts the payload with the shared
-secret key to retrieve the session key. The session key identifies the user for
-the current session. The client then requests a ticket on behalf of the user
-signed by the session key. The monitor generates a ticket, encrypts it with the
-user's secret key and transmits it back to the client. The client decrypts the
-ticket and uses it to sign requests to OSDs and metadata servers throughout the
-cluster.
-
-.. ditaa:: +---------+     +---------+
-           | Client  |     | Monitor |
-           +---------+     +---------+
-                |  authenticate |
-                |-------------->|----------+ generate and
-                |               |          | encrypt                
-                |<--------------|<---------+ session key
-                | transmit      |
-                | encrypted     |
-                | session key   |
-                |               |             
-                |-----+ decrypt |
-                |     | session | 
-                |<----+ key     |              
-                |               |
-                |  req. ticket  |
-                |-------------->|----------+ generate and
-                |               |          | encrypt                
-                |<--------------|<---------+ ticket
-                | recv. ticket  |
-                |               |             
-                |-----+ decrypt |
-                |     | ticket  | 
-                |<----+         |              
-
-
-The ``cephx`` protocol authenticates ongoing communications between the client
-machine and the Ceph servers. Each message sent between a client and server,
-subsequent to the initial authentication, is signed using a ticket that the
-monitors, OSDs and metadata servers can verify with their shared secret.
-
-.. ditaa:: +---------+     +---------+     +-------+     +-------+
-           |  Client |     | Monitor |     |  MDS  |     |  OSD  |
-           +---------+     +---------+     +-------+     +-------+
-                |  request to   |              |             |
-                | create a user |              |             |               
-                |-------------->| mon and      |             |
-                |<--------------| client share |             |
-                |    receive    | a secret.    |             |
-                | shared secret |              |             |
-                |               |<------------>|             |
-                |               |<-------------+------------>|
-                |               | mon, mds,    |             |
-                | authenticate  | and osd      |             |  
-                |-------------->| share        |             |
-                |<--------------| a secret     |             |
-                |  session key  |              |             |
-                |               |              |             |
-                |  req. ticket  |              |             |
-                |-------------->|              |             |
-                |<--------------|              |             |
-                | recv. ticket  |              |             |
-                |               |              |             |
-                |   make request (CephFS only) |             |
-                |----------------------------->|             |
-                |<-----------------------------|             |
-                | receive response (CephFS only)             |
-                |                                            |
-                |                make request                |
-                |------------------------------------------->|  
-                |<-------------------------------------------|
-                               receive response
-
-The protection offered by this authentication is between the Ceph client and the
-Ceph server hosts. The authentication is not extended beyond the Ceph client. If
-the user accesses the Ceph client from a remote host, Ceph authentication is not
-applied to the connection between the user's host and the client host.
-
-
-For configuration details, see `Cephx Config Guide`_. For user management 
-details, see `User Management`_.
-
-
-.. index:: architecture; smart daemons and scalability
-
-Smart Daemons Enable Hyperscale
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-In many clustered architectures, the primary purpose of cluster membership is 
-so that a centralized interface knows which nodes it can access. Then the
-centralized interface provides services to the client through a double
-dispatch--which is a **huge** bottleneck at the petabyte-to-exabyte scale.
-
-Ceph eliminates the bottleneck: Ceph's OSD Daemons AND Ceph Clients are cluster
-aware. Like Ceph clients, each Ceph OSD Daemon knows about other Ceph OSD
-Daemons in the cluster.  This enables Ceph OSD Daemons to interact directly with
-other Ceph OSD Daemons and Ceph Monitors. Additionally, it enables Ceph Clients
-to interact directly with Ceph OSD Daemons.
-
-The ability of Ceph Clients, Ceph Monitors and Ceph OSD Daemons to interact with
-each other means that Ceph OSD Daemons can utilize the CPU and RAM of the Ceph
-nodes to easily perform tasks that would bog down a centralized server. The
-ability to leverage this computing power leads to several major benefits:
-
-#. **OSDs Service Clients Directly:** Since any network device has a limit to 
-   the number of concurrent connections it can support, a centralized system 
-   has a low physical limit at high scales. By enabling Ceph Clients to contact 
-   Ceph OSD Daemons directly, Ceph increases both performance and total system 
-   capacity simultaneously, while removing a single point of failure. Ceph 
-   Clients can maintain a session when they need to, and with a particular Ceph 
-   OSD Daemon instead of a centralized server.
-
-#. **OSD Membership and Status**: Ceph OSD Daemons join a cluster and report 
-   on their status. At the lowest level, the Ceph OSD Daemon status is ``up`` 
-   or ``down`` reflecting whether or not it is running and able to service 
-   Ceph Client requests. If a Ceph OSD Daemon is ``down`` and ``in`` the Ceph 
-   Storage Cluster, this status may indicate the failure of the Ceph OSD 
-   Daemon. If a Ceph OSD Daemon is not running (e.g., it crashes), the Ceph OSD 
-   Daemon cannot notify the Ceph Monitor that it is ``down``. The OSDs
-   periodically send messages to the Ceph Monitor (``MPGStats`` pre-luminous,
-   and a new ``MOSDBeacon`` in luminous).  If the Ceph Monitor doesn't see that
-   message after a configurable period of time then it marks the OSD down.
-   This mechanism is a failsafe, however. Normally, Ceph OSD Daemons will
-   determine if a neighboring OSD is down and report it to the Ceph Monitor(s).
-   This assures that Ceph Monitors are lightweight processes.  See `Monitoring
-   OSDs`_ and `Heartbeats`_ for additional details.
-
-#. **Data Scrubbing:** As part of maintaining data consistency and cleanliness, 
-   Ceph OSD Daemons can scrub objects within placement groups. That is, Ceph 
-   OSD Daemons can compare object metadata in one placement group with its 
-   replicas in placement groups stored on other OSDs. Scrubbing (usually 
-   performed daily) catches bugs or filesystem errors. Ceph OSD Daemons also 
-   perform deeper scrubbing by comparing data in objects bit-for-bit. Deep 
-   scrubbing (usually performed weekly) finds bad sectors on a drive that 
-   weren't apparent in a light scrub. See `Data Scrubbing`_ for details on 
-   configuring scrubbing.
-
-#. **Replication:** Like Ceph Clients, Ceph OSD Daemons use the CRUSH 
-   algorithm, but the Ceph OSD Daemon uses it to compute where replicas of 
-   objects should be stored (and for rebalancing). In a typical write scenario, 
-   a client uses the CRUSH algorithm to compute where to store an object, maps 
-   the object to a pool and placement group, then looks at the CRUSH map to 
-   identify the primary OSD for the placement group.
-   
-   The client writes the object to the identified placement group in the 
-   primary OSD. Then, the primary OSD with its own copy of the CRUSH map 
-   identifies the secondary and tertiary OSDs for replication purposes, and 
-   replicates the object to the appropriate placement groups in the secondary 
-   and tertiary OSDs (as many OSDs as additional replicas), and responds to the
-   client once it has confirmed the object was stored successfully.
-
-.. ditaa:: 
-             +----------+
-             |  Client  |
-             |          |
-             +----------+
-                 *  ^
-      Write (1)  |  |  Ack (6)
-                 |  |
-                 v  *
-            +-------------+
-            | Primary OSD |
-            |             |
-            +-------------+
-              *  ^   ^  *
-    Write (2) |  |   |  |  Write (3)
-       +------+  |   |  +------+
-       |  +------+   +------+  |
-       |  | Ack (4)  Ack (5)|  | 
-       v  *                 *  v
- +---------------+   +---------------+
- | Secondary OSD |   | Tertiary OSD  |
- |               |   |               |
- +---------------+   +---------------+
-
-With the ability to perform data replication, Ceph OSD Daemons relieve Ceph
-clients from that duty, while ensuring high data availability and data safety.
-
-
-Dynamic Cluster Management
---------------------------
-
-In the `Scalability and High Availability`_ section, we explained how Ceph uses
-CRUSH, cluster awareness and intelligent daemons to scale and maintain high
-availability. Key to Ceph's design is the autonomous, self-healing, and
-intelligent Ceph OSD Daemon. Let's take a deeper look at how CRUSH works to
-enable modern cloud storage infrastructures to place data, rebalance the cluster
-and recover from faults dynamically.
-
-.. index:: architecture; pools
-
-About Pools
-~~~~~~~~~~~
-
-The Ceph storage system supports the notion of 'Pools', which are logical
-partitions for storing objects.
-
-Ceph Clients retrieve a `Cluster Map`_ from a Ceph Monitor, and write objects to
-pools. The pool's ``size`` or number of replicas, the CRUSH ruleset and the
-number of placement groups determine how Ceph will place the data.
-
-.. ditaa:: 
-            +--------+  Retrieves  +---------------+
-            | Client |------------>|  Cluster Map  |
-            +--------+             +---------------+
-                 |
-                 v      Writes
-              /-----\
-              | obj |
-              \-----/
-                 |      To
-                 v
-            +--------+           +---------------+
-            |  Pool  |---------->| CRUSH Ruleset |
-            +--------+  Selects  +---------------+
-                 
-
-Pools set at least the following parameters:
-
-- Ownership/Access to Objects
-- The Number of Placement Groups, and 
-- The CRUSH Ruleset to Use.
-
-See `Set Pool Values`_ for details.
-
-
-.. index: architecture; placement group mapping
-
-Mapping PGs to OSDs
-~~~~~~~~~~~~~~~~~~~
-
-Each pool has a number of placement groups. CRUSH maps PGs to OSDs dynamically.
-When a Ceph Client stores objects, CRUSH will map each object to a placement
-group.
-
-Mapping objects to placement groups creates a layer of indirection between the
-Ceph OSD Daemon and the Ceph Client. The Ceph Storage Cluster must be able to
-grow (or shrink) and rebalance where it stores objects dynamically. If the Ceph
-Client "knew" which Ceph OSD Daemon had which object, that would create a tight
-coupling between the Ceph Client and the Ceph OSD Daemon. Instead, the CRUSH
-algorithm maps each object to a placement group and then maps each placement
-group to one or more Ceph OSD Daemons. This layer of indirection allows Ceph to
-rebalance dynamically when new Ceph OSD Daemons and the underlying OSD devices
-come online. The following diagram depicts how CRUSH maps objects to placement
-groups, and placement groups to OSDs.
-
-.. ditaa:: 
-           /-----\  /-----\  /-----\  /-----\  /-----\
-           | obj |  | obj |  | obj |  | obj |  | obj |
-           \-----/  \-----/  \-----/  \-----/  \-----/
-              |        |        |        |        |
-              +--------+--------+        +---+----+
-              |                              |
-              v                              v
-   +-----------------------+      +-----------------------+
-   |  Placement Group #1   |      |  Placement Group #2   |
-   |                       |      |                       |
-   +-----------------------+      +-----------------------+
-               |                              |
-               |      +-----------------------+---+
-        +------+------+-------------+             |
-        |             |             |             |
-        v             v             v             v
-   /----------\  /----------\  /----------\  /----------\ 
-   |          |  |          |  |          |  |          |
-   |  OSD #1  |  |  OSD #2  |  |  OSD #3  |  |  OSD #4  |
-   |          |  |          |  |          |  |          |
-   \----------/  \----------/  \----------/  \----------/  
-
-With a copy of the cluster map and the CRUSH algorithm, the client can compute
-exactly which OSD to use when reading or writing a particular object.
-
-.. index:: architecture; calculating PG IDs
-
-Calculating PG IDs
-~~~~~~~~~~~~~~~~~~
-
-When a Ceph Client binds to a Ceph Monitor, it retrieves the latest copy of the
-`Cluster Map`_. With the cluster map, the client knows about all of the monitors,
-OSDs, and metadata servers in the cluster. **However, it doesn't know anything
-about object locations.** 
-
-.. epigraph:: 
-
-	Object locations get computed.
-
-
-The only input required by the client is the object ID and the pool.
-It's simple: Ceph stores data in named pools (e.g., "liverpool"). When a client
-wants to store a named object (e.g., "john," "paul," "george," "ringo", etc.)
-it calculates a placement group using the object name, a hash code, the
-number of PGs in the pool and the pool name. Ceph clients use the following
-steps to compute PG IDs.
-
-#. The client inputs the pool ID and the object ID. (e.g., pool = "liverpool" 
-   and object-id = "john")
-#. Ceph takes the object ID and hashes it.
-#. Ceph calculates the hash modulo the number of PGs. (e.g., ``58``) to get 
-   a PG ID.
-#. Ceph gets the pool ID given the pool name (e.g., "liverpool" = ``4``)
-#. Ceph prepends the pool ID to the PG ID (e.g., ``4.58``).
-
-Computing object locations is much faster than performing object location query
-over a chatty session. The :abbr:`CRUSH (Controlled Replication Under Scalable
-Hashing)` algorithm allows a client to compute where objects *should* be stored,
-and enables the client to contact the primary OSD to store or retrieve the
-objects.
-
-.. index:: architecture; PG Peering
-
-Peering and Sets
-~~~~~~~~~~~~~~~~
-
-In previous sections, we noted that Ceph OSD Daemons check each others
-heartbeats and report back to the Ceph Monitor. Another thing Ceph OSD daemons
-do is called 'peering', which is the process of bringing all of the OSDs that
-store a Placement Group (PG) into agreement about the state of all of the
-objects (and their metadata) in that PG. In fact, Ceph OSD Daemons `Report
-Peering Failure`_ to the Ceph Monitors. Peering issues  usually resolve
-themselves; however, if the problem persists, you may need to refer to the
-`Troubleshooting Peering Failure`_ section.
-
-.. Note:: Agreeing on the state does not mean that the PGs have the latest contents.
-
-The Ceph Storage Cluster was designed to store at least two copies of an object
-(i.e., ``size = 2``), which is the minimum requirement for data safety. For high
-availability, a Ceph Storage Cluster should store more than two copies of an object
-(e.g., ``size = 3`` and ``min size = 2``) so that it can continue to run in a 
-``degraded`` state while maintaining data safety.
-
-Referring back to the diagram in `Smart Daemons Enable Hyperscale`_, we do not 
-name the Ceph OSD Daemons specifically (e.g., ``osd.0``, ``osd.1``, etc.), but 
-rather refer to them as *Primary*, *Secondary*, and so forth. By convention, 
-the *Primary* is the first OSD in the *Acting Set*, and is responsible for 
-coordinating the peering process for each placement group where it acts as 
-the *Primary*, and is the **ONLY** OSD that that will accept client-initiated 
-writes to objects for a given placement group where it acts as the *Primary*.
-
-When a series of OSDs are responsible for a placement group, that series of
-OSDs, we refer to them as an *Acting Set*. An *Acting Set* may refer to the Ceph
-OSD Daemons that are currently responsible for the placement group, or the Ceph
-OSD Daemons that were responsible  for a particular placement group as of some
-epoch.
-
-The Ceph OSD daemons that are part of an *Acting Set* may not always be  ``up``.
-When an OSD in the *Acting Set* is ``up``, it is part of the  *Up Set*. The *Up
-Set* is an important distinction, because Ceph can remap PGs to other Ceph OSD
-Daemons when an OSD fails. 
-
-.. note:: In an *Acting Set* for a PG containing ``osd.25``, ``osd.32`` and 
-   ``osd.61``, the first OSD, ``osd.25``, is the *Primary*. If that OSD fails,
-   the Secondary, ``osd.32``, becomes the *Primary*, and ``osd.25`` will be 
-   removed from the *Up Set*.
-
-
-.. index:: architecture; Rebalancing
-
-Rebalancing
-~~~~~~~~~~~
-
-When you add a Ceph OSD Daemon to a Ceph Storage Cluster, the cluster map gets
-updated with the new OSD. Referring back to `Calculating PG IDs`_, this changes
-the cluster map. Consequently, it changes object placement, because it changes
-an input for the calculations. The following diagram depicts the rebalancing
-process (albeit rather crudely, since it is substantially less impactful with
-large clusters) where some, but not all of the PGs migrate from existing OSDs
-(OSD 1, and OSD 2) to the new OSD (OSD 3). Even when rebalancing, CRUSH is
-stable. Many of the placement groups remain in their original configuration,
-and each OSD gets some added capacity, so there are no load spikes on the 
-new OSD after rebalancing is complete.
-
-
-.. ditaa:: 
-           +--------+     +--------+
-   Before  |  OSD 1 |     |  OSD 2 |
-           +--------+     +--------+
-           |  PG #1 |     | PG #6  |
-           |  PG #2 |     | PG #7  |
-           |  PG #3 |     | PG #8  |
-           |  PG #4 |     | PG #9  |
-           |  PG #5 |     | PG #10 |
-           +--------+     +--------+
-
-           +--------+     +--------+     +--------+
-    After  |  OSD 1 |     |  OSD 2 |     |  OSD 3 |
-           +--------+     +--------+     +--------+
-           |  PG #1 |     | PG #7  |     |  PG #3 |
-           |  PG #2 |     | PG #8  |     |  PG #6 |
-           |  PG #4 |     | PG #10 |     |  PG #9 |
-           |  PG #5 |     |        |     |        |
-           |        |     |        |     |        |
-           +--------+     +--------+     +--------+
-
-
-.. index:: architecture; Data Scrubbing
-
-Data Consistency
-~~~~~~~~~~~~~~~~
-
-As part of maintaining data consistency and cleanliness, Ceph OSDs can also
-scrub objects within placement groups. That is, Ceph OSDs can compare object
-metadata in one placement group with its replicas in placement groups stored in
-other OSDs. Scrubbing (usually performed daily) catches OSD bugs or filesystem
-errors.  OSDs can also perform deeper scrubbing by comparing data in objects
-bit-for-bit.  Deep scrubbing (usually performed weekly) finds bad sectors on a
-disk that weren't apparent in a light scrub.
-
-See `Data Scrubbing`_ for details on configuring scrubbing.
-
-
-
-
-
-.. index:: erasure coding
-
-Erasure Coding
---------------
-
-An erasure coded pool stores each object as ``K+M`` chunks. It is divided into
-``K`` data chunks and ``M`` coding chunks. The pool is configured to have a size
-of ``K+M`` so that each chunk is stored in an OSD in the acting set. The rank of
-the chunk is stored as an attribute of the object.
-
-For instance an erasure coded pool is created to use five OSDs (``K+M = 5``) and
-sustain the loss of two of them (``M = 2``).
-
-Reading and Writing Encoded Chunks
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-When the object **NYAN** containing ``ABCDEFGHI`` is written to the pool, the erasure
-encoding function splits the content into three data chunks simply by dividing
-the content in three: the first contains ``ABC``, the second ``DEF`` and the
-last ``GHI``. The content will be padded if the content length is not a multiple
-of ``K``. The function also creates two coding chunks: the fourth with ``YXY``
-and the fifth with ``GQC``. Each chunk is stored in an OSD in the acting set.
-The chunks are stored in objects that have the same name (**NYAN**) but reside
-on different OSDs. The order in which the chunks were created must be preserved
-and is stored as an attribute of the object (``shard_t``), in addition to its
-name. Chunk 1 contains ``ABC`` and is stored on **OSD5** while chunk 4 contains
-``YXY`` and is stored on **OSD3**.
-
-
-.. ditaa::
-                            +-------------------+
-                       name |       NYAN        |
-                            +-------------------+
-                    content |     ABCDEFGHI     |
-                            +--------+----------+
-                                     |
-                                     |
-                                     v
-                              +------+------+
-              +---------------+ encode(3,2) +-----------+
-              |               +--+--+---+---+           |
-              |                  |  |   |               |
-              |          +-------+  |   +-----+         |
-              |          |          |         |         |
-           +--v---+   +--v---+   +--v---+  +--v---+  +--v---+
-     name  | NYAN |   | NYAN |   | NYAN |  | NYAN |  | NYAN |
-           +------+   +------+   +------+  +------+  +------+
-    shard  |  1   |   |  2   |   |  3   |  |  4   |  |  5   |
-           +------+   +------+   +------+  +------+  +------+
-  content  | ABC  |   | DEF  |   | GHI  |  | YXY  |  | QGC  |
-           +--+---+   +--+---+   +--+---+  +--+---+  +--+---+
-              |          |          |         |         |
-              |          |          v         |         |
-              |          |       +--+---+     |         |
-              |          |       | OSD1 |     |         |
-              |          |       +------+     |         |
-              |          |                    |         |
-              |          |       +------+     |         |
-              |          +------>| OSD2 |     |         |
-              |                  +------+     |         |
-              |                               |         |
-              |                  +------+     |         |
-              |                  | OSD3 |<----+         |
-              |                  +------+               |
-              |                                         |
-              |                  +------+               |
-              |                  | OSD4 |<--------------+
-              |                  +------+
-              |
-              |                  +------+
-              +----------------->| OSD5 |
-                                 +------+
-
-
-When the object **NYAN** is read from the erasure coded pool, the decoding
-function reads three chunks: chunk 1 containing ``ABC``, chunk 3 containing
-``GHI`` and chunk 4 containing ``YXY``. Then, it rebuilds the original content
-of the object ``ABCDEFGHI``. The decoding function is informed that the chunks 2
-and 5 are missing (they are called 'erasures'). The chunk 5 could not be read
-because the **OSD4** is out. The decoding function can be called as soon as
-three chunks are read: **OSD2** was the slowest and its chunk was not taken into
-account.
-
-.. ditaa::
-	                         +-------------------+
-	                    name |       NYAN        |
-	                         +-------------------+
-	                 content |     ABCDEFGHI     |
-	                         +---------+---------+
-	                                   ^
-	                                   |
-	                                   |
-	                           +-------+-------+
-	                           |  decode(3,2)  |
-	            +------------->+  erasures 2,5 +<-+
-	            |              |               |  |
-	            |              +-------+-------+  |
-	            |                      ^          |
-	            |                      |          | 
-	            |                      |          |
-	         +--+---+   +------+   +---+--+   +---+--+
-	   name  | NYAN |   | NYAN |   | NYAN |   | NYAN |
-	         +------+   +------+   +------+   +------+
-	  shard  |  1   |   |  2   |   |  3   |   |  4   |
-	         +------+   +------+   +------+   +------+
-	content  | ABC  |   | DEF  |   | GHI  |   | YXY  |
-	         +--+---+   +--+---+   +--+---+   +--+---+
-	            ^          .          ^          ^
-	            |    TOO   .          |          |
-	            |    SLOW  .       +--+---+      |
-	            |          ^       | OSD1 |      |
-	            |          |       +------+      |
-	            |          |                     |
-	            |          |       +------+      |
-	            |          +-------| OSD2 |      |
-	            |                  +------+      |
-	            |                                |
-	            |                  +------+      |
-	            |                  | OSD3 |------+
-	            |                  +------+
-	            |
-	            |                  +------+
-	            |                  | OSD4 | OUT
-	            |                  +------+
-	            |
-	            |                  +------+
-	            +------------------| OSD5 |
-	                               +------+
-
-
-Interrupted Full Writes
-~~~~~~~~~~~~~~~~~~~~~~~
-
-In an erasure coded pool, the primary OSD in the up set receives all write
-operations. It is responsible for encoding the payload into ``K+M`` chunks and
-sends them to the other OSDs. It is also responsible for maintaining an
-authoritative version of the placement group logs.
-
-In the following diagram, an erasure coded placement group has been created with
-``K = 2 + M = 1`` and is supported by three OSDs, two for ``K`` and one for
-``M``. The acting set of the placement group is made of **OSD 1**, **OSD 2** and
-**OSD 3**. An object has been encoded and stored in the OSDs : the chunk
-``D1v1`` (i.e. Data chunk number 1, version 1) is on **OSD 1**, ``D2v1`` on
-**OSD 2** and ``C1v1`` (i.e. Coding chunk number 1, version 1) on **OSD 3**. The
-placement group logs on each OSD are identical (i.e. ``1,1`` for epoch 1,
-version 1).
-
-
-.. ditaa::
-     Primary OSD
-    
-   +-------------+
-   |    OSD 1    |             +-------------+
-   |         log |  Write Full |             |
-   |  +----+     |<------------+ Ceph Client |
-   |  |D1v1| 1,1 |      v1     |             |
-   |  +----+     |             +-------------+
-   +------+------+
-          |
-          |
-          |          +-------------+
-          |          |    OSD 2    |
-          |          |         log |
-          +--------->+  +----+     |
-          |          |  |D2v1| 1,1 |
-          |          |  +----+     |
-          |          +-------------+
-          |
-          |          +-------------+
-          |          |    OSD 3    |
-          |          |         log |
-          +--------->|  +----+     |
-                     |  |C1v1| 1,1 |
-                     |  +----+     |
-                     +-------------+
-
-**OSD 1** is the primary and receives a **WRITE FULL** from a client, which
-means the payload is to replace the object entirely instead of overwriting a
-portion of it. Version 2 (v2) of the object is created to override version 1
-(v1). **OSD 1** encodes the payload into three chunks: ``D1v2`` (i.e. Data
-chunk number 1 version 2) will be on **OSD 1**, ``D2v2`` on **OSD 2** and
-``C1v2`` (i.e. Coding chunk number 1 version 2) on **OSD 3**. Each chunk is sent
-to the target OSD, including the primary OSD which is responsible for storing
-chunks in addition to handling write operations and maintaining an authoritative
-version of the placement group logs. When an OSD receives the message
-instructing it to write the chunk, it also creates a new entry in the placement
-group logs to reflect the change. For instance, as soon as **OSD 3** stores
-``C1v2``, it adds the entry ``1,2`` ( i.e. epoch 1, version 2 ) to its logs.
-Because the OSDs work asynchronously, some chunks may still be in flight ( such
-as ``D2v2`` ) while others are acknowledged and on disk ( such as ``C1v1`` and
-``D1v1``).
-
-.. ditaa::
-
-     Primary OSD
-    
-   +-------------+
-   |    OSD 1    |
-   |         log |
-   |  +----+     |             +-------------+
-   |  |D1v2| 1,2 |  Write Full |             |
-   |  +----+     +<------------+ Ceph Client |
-   |             |      v2     |             |
-   |  +----+     |             +-------------+
-   |  |D1v1| 1,1 |           
-   |  +----+     |           
-   +------+------+           
-          |                  
-          |                  
-          |           +------+------+
-          |           |    OSD 2    |
-          |  +------+ |         log |
-          +->| D2v2 | |  +----+     |
-          |  +------+ |  |D2v1| 1,1 |
-          |           |  +----+     |
-          |           +-------------+
-          |
-          |           +-------------+
-          |           |    OSD 3    |
-          |           |         log |
-          |           |  +----+     |
-          |           |  |C1v2| 1,2 |
-          +---------->+  +----+     |
-                      |             |
-                      |  +----+     |
-                      |  |C1v1| 1,1 |
-                      |  +----+     |
-                      +-------------+
-
-
-If all goes well, the chunks are acknowledged on each OSD in the acting set and
-the logs' ``last_complete`` pointer can move from ``1,1`` to ``1,2``.
-
-.. ditaa::
-
-     Primary OSD
-    
-   +-------------+
-   |    OSD 1    |
-   |         log |
-   |  +----+     |             +-------------+
-   |  |D1v2| 1,2 |  Write Full |             |
-   |  +----+     +<------------+ Ceph Client |
-   |             |      v2     |             |
-   |  +----+     |             +-------------+
-   |  |D1v1| 1,1 |           
-   |  +----+     |           
-   +------+------+           
-          |                  
-          |           +-------------+
-          |           |    OSD 2    |
-          |           |         log |
-          |           |  +----+     |
-          |           |  |D2v2| 1,2 |
-          +---------->+  +----+     |
-          |           |             |
-          |           |  +----+     |
-          |           |  |D2v1| 1,1 |
-          |           |  +----+     |
-          |           +-------------+
-          |                  
-          |           +-------------+
-          |           |    OSD 3    |
-          |           |         log |
-          |           |  +----+     |
-          |           |  |C1v2| 1,2 |
-          +---------->+  +----+     |
-                      |             |
-                      |  +----+     |
-                      |  |C1v1| 1,1 |
-                      |  +----+     |
-                      +-------------+
-
-
-Finally, the files used to store the chunks of the previous version of the
-object can be removed: ``D1v1`` on **OSD 1**, ``D2v1`` on **OSD 2** and ``C1v1``
-on **OSD 3**.
-
-.. ditaa::
-     Primary OSD
-    
-   +-------------+
-   |    OSD 1    |
-   |         log |
-   |  +----+     |
-   |  |D1v2| 1,2 |
-   |  +----+     |
-   +------+------+
-          |
-          |
-          |          +-------------+
-          |          |    OSD 2    |
-          |          |         log |
-          +--------->+  +----+     |
-          |          |  |D2v2| 1,2 |
-          |          |  +----+     |
-          |          +-------------+
-          |
-          |          +-------------+
-          |          |    OSD 3    |
-          |          |         log |
-          +--------->|  +----+     |
-                     |  |C1v2| 1,2 |
-                     |  +----+     |
-                     +-------------+
-
-
-But accidents happen. If **OSD 1** goes down while ``D2v2`` is still in flight,
-the object's version 2 is partially written: **OSD 3** has one chunk but that is
-not enough to recover. It lost two chunks: ``D1v2`` and ``D2v2`` and the
-erasure coding parameters ``K = 2``, ``M = 1`` require that at least two chunks are
-available to rebuild the third. **OSD 4** becomes the new primary and finds that
-the ``last_complete`` log entry (i.e., all objects before this entry were known
-to be available on all OSDs in the previous acting set ) is ``1,1`` and that
-will be the head of the new authoritative log.
-
-.. ditaa::
-   +-------------+
-   |    OSD 1    |
-   |   (down)    |
-   | c333        |
-   +------+------+
-          |                  
-          |           +-------------+
-          |           |    OSD 2    |
-          |           |         log |
-          |           |  +----+     |
-          +---------->+  |D2v1| 1,1 |
-          |           |  +----+     |
-          |           |             |
-          |           +-------------+
-          |                  
-          |           +-------------+
-          |           |    OSD 3    |
-          |           |         log |
-          |           |  +----+     |
-          |           |  |C1v2| 1,2 |
-          +---------->+  +----+     |
-                      |             |
-                      |  +----+     |
-                      |  |C1v1| 1,1 |
-                      |  +----+     |
-                      +-------------+
-     Primary OSD
-   +-------------+
-   |    OSD 4    |
-   |         log |
-   |             |
-   |         1,1 |
-   |             |
-   +------+------+
-          
-
-
-The log entry 1,2 found on **OSD 3** is divergent from the new authoritative log
-provided by **OSD 4**: it is discarded and the file containing the ``C1v2``
-chunk is removed. The ``D1v1`` chunk is rebuilt with the ``decode`` function of
-the erasure coding library during scrubbing and stored on the new primary 
-**OSD 4**.
-
-
-.. ditaa::
-     Primary OSD
-    
-   +-------------+
-   |    OSD 4    |
-   |         log |
-   |  +----+     |
-   |  |D1v1| 1,1 |
-   |  +----+     |
-   +------+------+
-          ^
-          |
-          |          +-------------+
-          |          |    OSD 2    |
-          |          |         log |
-          +----------+  +----+     |
-          |          |  |D2v1| 1,1 |
-          |          |  +----+     |
-          |          +-------------+
-          |
-          |          +-------------+
-          |          |    OSD 3    |
-          |          |         log |
-          +----------|  +----+     |
-                     |  |C1v1| 1,1 |
-                     |  +----+     |
-                     +-------------+
-
-   +-------------+
-   |    OSD 1    |
-   |   (down)    |
-   | c333        |
-   +-------------+
-
-See `Erasure Code Notes`_ for additional details.
-
-
-
-Cache Tiering
--------------
-
-A cache tier provides Ceph Clients with better I/O performance for a subset of
-the data stored in a backing storage tier. Cache tiering involves creating a
-pool of relatively fast/expensive storage devices (e.g., solid state drives)
-configured to act as a cache tier, and a backing pool of either erasure-coded
-or relatively slower/cheaper devices configured to act as an economical storage
-tier. The Ceph objecter handles where to place the objects and the tiering
-agent determines when to flush objects from the cache to the backing storage
-tier. So the cache tier and the backing storage tier are completely transparent 
-to Ceph clients.
-
-
-.. ditaa:: 
-           +-------------+
-           | Ceph Client |
-           +------+------+
-                  ^
-     Tiering is   |  
-    Transparent   |              Faster I/O
-        to Ceph   |           +---------------+
-     Client Ops   |           |               |   
-                  |    +----->+   Cache Tier  |
-                  |    |      |               |
-                  |    |      +-----+---+-----+
-                  |    |            |   ^ 
-                  v    v            |   |   Active Data in Cache Tier
-           +------+----+--+         |   |
-           |   Objecter   |         |   |
-           +-----------+--+         |   |
-                       ^            |   |   Inactive Data in Storage Tier
-                       |            v   |
-                       |      +-----+---+-----+
-                       |      |               |
-                       +----->|  Storage Tier |
-                              |               |
-                              +---------------+
-                                 Slower I/O
-
-See `Cache Tiering`_ for additional details.
-
-
-.. index:: Extensibility, Ceph Classes
-
-Extending Ceph
---------------
-
-You can extend Ceph by creating shared object classes called 'Ceph Classes'.
-Ceph loads ``.so`` classes stored in the ``osd class dir`` directory dynamically
-(i.e., ``$libdir/rados-classes`` by default). When you implement a class, you
-can create new object methods that have the ability to call the native methods
-in the Ceph Object Store, or other class methods you incorporate via libraries
-or create yourself.
-
-On writes, Ceph Classes can call native or class methods, perform any series of
-operations on the inbound data and generate a resulting write transaction  that
-Ceph will apply atomically.
-
-On reads, Ceph Classes can call native or class methods, perform any series of
-operations on the outbound data and return the data to the client.
-
-.. topic:: Ceph Class Example
-
-   A Ceph class for a content management system that presents pictures of a
-   particular size and aspect ratio could take an inbound bitmap image, crop it
-   to a particular aspect ratio, resize it and embed an invisible copyright or 
-   watermark to help protect the intellectual property; then, save the 
-   resulting bitmap image to the object store.
-
-See ``src/objclass/objclass.h``, ``src/fooclass.cc`` and ``src/barclass`` for 
-exemplary implementations.
-
-
-Summary
--------
-
-Ceph Storage Clusters are dynamic--like a living organism. Whereas, many storage
-appliances do not fully utilize the CPU and RAM of a typical commodity server,
-Ceph does. From heartbeats, to  peering, to rebalancing the cluster or
-recovering from faults,  Ceph offloads work from clients (and from a centralized
-gateway which doesn't exist in the Ceph architecture) and uses the computing
-power of the OSDs to perform the work. When referring to `Hardware
-Recommendations`_ and the `Network Config Reference`_,  be cognizant of the
-foregoing concepts to understand how Ceph utilizes computing resources.
-
-.. index:: Ceph Protocol, librados
-
-Ceph Protocol
-=============
-
-Ceph Clients use the native protocol for interacting with the Ceph Storage
-Cluster. Ceph packages this functionality into the ``librados`` library so that
-you can create your own custom Ceph Clients. The following diagram depicts the
-basic architecture.
-
-.. ditaa::  
-            +---------------------------------+
-            |  Ceph Storage Cluster Protocol  |
-            |           (librados)            |
-            +---------------------------------+
-            +---------------+ +---------------+
-            |      OSDs     | |    Monitors   |
-            +---------------+ +---------------+
-
-
-Native Protocol and ``librados``
---------------------------------
-
-Modern applications need a simple object storage interface with asynchronous
-communication capability. The Ceph Storage Cluster provides a simple object
-storage interface with asynchronous communication capability. The interface
-provides direct, parallel access to objects throughout the cluster.
-
-
-- Pool Operations
-- Snapshots and Copy-on-write Cloning
-- Read/Write Objects
-  - Create or Remove
-  - Entire Object or Byte Range
-  - Append or Truncate
-- Create/Set/Get/Remove XATTRs
-- Create/Set/Get/Remove Key/Value Pairs
-- Compound operations and dual-ack semantics
-- Object Classes
-
-
-.. index:: architecture; watch/notify
-
-Object Watch/Notify
--------------------
-
-A client can register a persistent interest with an object and keep a session to
-the primary OSD open. The client can send a notification message and a payload to
-all watchers and receive notification when the watchers receive the
-notification. This enables a client to use any object as a
-synchronization/communication channel.
-
-
-.. ditaa:: +----------+     +----------+     +----------+     +---------------+
-           | Client 1 |     | Client 2 |     | Client 3 |     | OSD:Object ID |
-           +----------+     +----------+     +----------+     +---------------+
-                 |                |                |                  |
-                 |                |                |                  |
-                 |                |  Watch Object  |                  |               
-                 |--------------------------------------------------->|
-                 |                |                |                  |
-                 |<---------------------------------------------------|
-                 |                |   Ack/Commit   |                  |
-                 |                |                |                  |
-                 |                |  Watch Object  |                  |
-                 |                |---------------------------------->|
-                 |                |                |                  |
-                 |                |<----------------------------------|
-                 |                |   Ack/Commit   |                  |
-                 |                |                |   Watch Object   |
-                 |                |                |----------------->|
-                 |                |                |                  |
-                 |                |                |<-----------------|
-                 |                |                |    Ack/Commit    |
-                 |                |     Notify     |                  |               
-                 |--------------------------------------------------->|
-                 |                |                |                  |
-                 |<---------------------------------------------------|
-                 |                |     Notify     |                  |
-                 |                |                |                  |
-                 |                |<----------------------------------|
-                 |                |     Notify     |                  |
-                 |                |                |<-----------------|
-                 |                |                |      Notify      |
-                 |                |       Ack      |                  |               
-                 |----------------+---------------------------------->|
-                 |                |                |                  |
-                 |                |       Ack      |                  |
-                 |                +---------------------------------->|
-                 |                |                |                  |
-                 |                |                |        Ack       |
-                 |                |                |----------------->|
-                 |                |                |                  | 
-                 |<---------------+----------------+------------------|
-                 |                     Complete
-
-.. index:: architecture; Striping
-
-Data Striping
--------------
-
-Storage devices have throughput limitations, which impact performance and
-scalability. So storage systems often support `striping`_--storing sequential
-pieces of information across multiple storage devices--to increase throughput
-and performance. The most common form of data striping comes from `RAID`_.
-The RAID type most similar to Ceph's striping is `RAID 0`_, or a 'striped
-volume'. Ceph's striping offers the throughput of RAID 0 striping, the
-reliability of n-way RAID mirroring and faster recovery.
-
-Ceph provides three types of clients: Ceph Block Device, Ceph Filesystem, and
-Ceph Object Storage. A Ceph Client converts its data from the representation 
-format it provides to its users (a block device image, RESTful objects, CephFS
-filesystem directories) into objects for storage in the Ceph Storage Cluster. 
-
-.. tip:: The objects Ceph stores in the Ceph Storage Cluster are not striped. 
-   Ceph Object Storage, Ceph Block Device, and the Ceph Filesystem stripe their 
-   data over multiple Ceph Storage Cluster objects. Ceph Clients that write 
-   directly to the Ceph Storage Cluster via ``librados`` must perform the
-   striping (and parallel I/O) for themselves to obtain these benefits.
-
-The simplest Ceph striping format involves a stripe count of 1 object. Ceph
-Clients write stripe units to a Ceph Storage Cluster object until the object is
-at its maximum capacity, and then create another object for additional stripes
-of data. The simplest form of striping may be sufficient for small block device
-images, S3 or Swift objects and CephFS files. However, this simple form doesn't
-take maximum advantage of Ceph's ability to distribute data across placement
-groups, and consequently doesn't improve performance very much. The following
-diagram depicts the simplest form of striping:
-
-.. ditaa::              
-                        +---------------+
-                        |  Client Data  |
-                        |     Format    |
-                        | cCCC          |
-                        +---------------+
-                                |
-                       +--------+-------+
-                       |                |
-                       v                v
-                 /-----------\    /-----------\
-                 | Begin cCCC|    | Begin cCCC|
-                 | Object  0 |    | Object  1 |
-                 +-----------+    +-----------+
-                 |  stripe   |    |  stripe   |
-                 |  unit 1   |    |  unit 5   |
-                 +-----------+    +-----------+
-                 |  stripe   |    |  stripe   |
-                 |  unit 2   |    |  unit 6   |
-                 +-----------+    +-----------+
-                 |  stripe   |    |  stripe   |
-                 |  unit 3   |    |  unit 7   |
-                 +-----------+    +-----------+
-                 |  stripe   |    |  stripe   |
-                 |  unit 4   |    |  unit 8   |
-                 +-----------+    +-----------+
-                 | End cCCC  |    | End cCCC  |
-                 | Object 0  |    | Object 1  |
-                 \-----------/    \-----------/
-   
-
-If you anticipate large images sizes, large S3 or Swift objects (e.g., video),
-or large CephFS directories, you may see considerable read/write performance
-improvements by striping client data over multiple objects within an object set.
-Significant write performance occurs when the client writes the stripe units to
-their corresponding objects in parallel. Since objects get mapped to different
-placement groups and further mapped to different OSDs, each write occurs in
-parallel at the maximum write speed. A write to a single disk would be limited
-by the head movement (e.g. 6ms per seek) and bandwidth of that one device (e.g.
-100MB/s).  By spreading that write over multiple objects (which map to different
-placement groups and OSDs) Ceph can reduce the number of seeks per drive and
-combine the throughput of multiple drives to achieve much faster write (or read)
-speeds.
-
-.. note:: Striping is independent of object replicas. Since CRUSH
-   replicates objects across OSDs, stripes get replicated automatically.
-
-In the following diagram, client data gets striped across an object set
-(``object set 1`` in the following diagram) consisting of 4 objects, where the
-first stripe unit is ``stripe unit 0`` in ``object 0``, and the fourth stripe
-unit is ``stripe unit 3`` in ``object 3``. After writing the fourth stripe, the
-client determines if the object set is full. If the object set is not full, the
-client begins writing a stripe to the first object again (``object 0`` in the
-following diagram). If the object set is full, the client creates a new object
-set (``object set 2`` in the following diagram), and begins writing to the first
-stripe (``stripe unit 16``) in the first object in the new object set (``object
-4`` in the diagram below).
-
-.. ditaa::                 
-                          +---------------+
-                          |  Client Data  |
-                          |     Format    |
-                          | cCCC          |
-                          +---------------+
-                                  |
-       +-----------------+--------+--------+-----------------+
-       |                 |                 |                 |     +--\
-       v                 v                 v                 v        |
- /-----------\     /-----------\     /-----------\     /-----------\  |   
- | Begin cCCC|     | Begin cCCC|     | Begin cCCC|     | Begin cCCC|  |
- | Object 0  |     | Object  1 |     | Object  2 |     | Object  3 |  |
- +-----------+     +-----------+     +-----------+     +-----------+  |
- |  stripe   |     |  stripe   |     |  stripe   |     |  stripe   |  |
- |  unit 0   |     |  unit 1   |     |  unit 2   |     |  unit 3   |  |
- +-----------+     +-----------+     +-----------+     +-----------+  |
- |  stripe   |     |  stripe   |     |  stripe   |     |  stripe   |  +-\ 
- |  unit 4   |     |  unit 5   |     |  unit 6   |     |  unit 7   |    | Object
- +-----------+     +-----------+     +-----------+     +-----------+    +- Set 
- |  stripe   |     |  stripe   |     |  stripe   |     |  stripe   |    |   1
- |  unit 8   |     |  unit 9   |     |  unit 10  |     |  unit 11  |  +-/
- +-----------+     +-----------+     +-----------+     +-----------+  |
- |  stripe   |     |  stripe   |     |  stripe   |     |  stripe   |  |
- |  unit 12  |     |  unit 13  |     |  unit 14  |     |  unit 15  |  |
- +-----------+     +-----------+     +-----------+     +-----------+  |
- | End cCCC  |     | End cCCC  |     | End cCCC  |     | End cCCC  |  |
- | Object 0  |     | Object 1  |     | Object 2  |     | Object 3  |  |  
- \-----------/     \-----------/     \-----------/     \-----------/  |
-                                                                      |
-                                                                   +--/
-  
-                                                                   +--\
-                                                                      |
- /-----------\     /-----------\     /-----------\     /-----------\  |   
- | Begin cCCC|     | Begin cCCC|     | Begin cCCC|     | Begin cCCC|  |
- | Object  4 |     | Object  5 |     | Object  6 |     | Object  7 |  |  
- +-----------+     +-----------+     +-----------+     +-----------+  |
- |  stripe   |     |  stripe   |     |  stripe   |     |  stripe   |  |
- |  unit 16  |     |  unit 17  |     |  unit 18  |     |  unit 19  |  |
- +-----------+     +-----------+     +-----------+     +-----------+  |
- |  stripe   |     |  stripe   |     |  stripe   |     |  stripe   |  +-\ 
- |  unit 20  |     |  unit 21  |     |  unit 22  |     |  unit 23  |    | Object
- +-----------+     +-----------+     +-----------+     +-----------+    +- Set
- |  stripe   |     |  stripe   |     |  stripe   |     |  stripe   |    |   2 
- |  unit 24  |     |  unit 25  |     |  unit 26  |     |  unit 27  |  +-/
- +-----------+     +-----------+     +-----------+     +-----------+  |
- |  stripe   |     |  stripe   |     |  stripe   |     |  stripe   |  |
- |  unit 28  |     |  unit 29  |     |  unit 30  |     |  unit 31  |  |
- +-----------+     +-----------+     +-----------+     +-----------+  |
- | End cCCC  |     | End cCCC  |     | End cCCC  |     | End cCCC  |  |
- | Object 4  |     | Object 5  |     | Object 6  |     | Object 7  |  |  
- \-----------/     \-----------/     \-----------/     \-----------/  |
-                                                                      |
-                                                                   +--/
-
-Three important variables determine how Ceph stripes data: 
-
-- **Object Size:** Objects in the Ceph Storage Cluster have a maximum
-  configurable size (e.g., 2MB, 4MB, etc.). The object size should be large
-  enough to accommodate many stripe units, and should be a multiple of
-  the stripe unit.
-
-- **Stripe Width:** Stripes have a configurable unit size (e.g., 64kb).
-  The Ceph Client divides the data it will write to objects into equally 
-  sized stripe units, except for the last stripe unit. A stripe width, 
-  should be a fraction of the Object Size so that an object may contain 
-  many stripe units.
-
-- **Stripe Count:** The Ceph Client writes a sequence of stripe units
-  over a series of objects determined by the stripe count. The series 
-  of objects is called an object set. After the Ceph Client writes to 
-  the last object in the object set, it returns to the first object in
-  the object set.
-  
-.. important:: Test the performance of your striping configuration before
-   putting your cluster into production. You CANNOT change these striping
-   parameters after you stripe the data and write it to objects.
-
-Once the Ceph Client has striped data to stripe units and mapped the stripe
-units to objects, Ceph's CRUSH algorithm maps the objects to placement groups,
-and the placement groups to Ceph OSD Daemons before the objects are stored as 
-files on a storage disk.
-
-.. note:: Since a client writes to a single pool, all data striped into objects
-   get mapped to placement groups in the same pool. So they use the same CRUSH
-   map and the same access controls.
-
-
-.. index:: architecture; Ceph Clients
-
-Ceph Clients
-============
-
-Ceph Clients include a number of service interfaces. These include:
-
-- **Block Devices:** The :term:`Ceph Block Device` (a.k.a., RBD) service 
-  provides resizable, thin-provisioned block devices with snapshotting and
-  cloning. Ceph stripes a block device across the cluster for high
-  performance. Ceph supports both kernel objects (KO) and a QEMU hypervisor 
-  that uses ``librbd`` directly--avoiding the kernel object overhead for 
-  virtualized systems.
-
-- **Object Storage:** The :term:`Ceph Object Storage` (a.k.a., RGW) service 
-  provides RESTful APIs with interfaces that are compatible with Amazon S3
-  and OpenStack Swift. 
-  
-- **Filesystem**: The :term:`Ceph Filesystem` (CephFS) service provides 
-  a POSIX compliant filesystem usable with ``mount`` or as 
-  a filesytem in user space (FUSE).      
-
-Ceph can run additional instances of OSDs, MDSs, and monitors for scalability
-and high availability. The following diagram depicts the high-level
-architecture. 
-
-.. ditaa::
-            +--------------+  +----------------+  +-------------+
-            | Block Device |  | Object Storage |  |   Ceph FS   |
-            +--------------+  +----------------+  +-------------+            
-
-            +--------------+  +----------------+  +-------------+
-            |    librbd    |  |     librgw     |  |  libcephfs  |
-            +--------------+  +----------------+  +-------------+
-
-            +---------------------------------------------------+
-            |      Ceph Storage Cluster Protocol (librados)     |
-            +---------------------------------------------------+
-
-            +---------------+ +---------------+ +---------------+
-            |      OSDs     | |      MDSs     | |    Monitors   |
-            +---------------+ +---------------+ +---------------+
-
-
-.. index:: architecture; Ceph Object Storage
-
-Ceph Object Storage
--------------------
-
-The Ceph Object Storage daemon, ``radosgw``, is a FastCGI service that provides
-a RESTful_ HTTP API to store objects and metadata. It layers on top of the Ceph
-Storage Cluster with its own data formats, and maintains its own user database,
-authentication, and access control. The RADOS Gateway uses a unified namespace,
-which means you can use either the OpenStack Swift-compatible API or the Amazon
-S3-compatible API. For example, you can write data using the S3-compatible API
-with one application and then read data using the Swift-compatible API with
-another application.
-
-.. topic:: S3/Swift Objects and Store Cluster Objects Compared
-
-   Ceph's Object Storage uses the term *object* to describe the data it stores.
-   S3 and Swift objects are not the same as the objects that Ceph writes to the 
-   Ceph Storage Cluster. Ceph Object Storage objects are mapped to Ceph Storage
-   Cluster objects. The S3 and Swift objects do not necessarily 
-   correspond in a 1:1 manner with an object stored in the storage cluster. It 
-   is possible for an S3 or Swift object to map to multiple Ceph objects.
-
-See `Ceph Object Storage`_ for details.
-
-
-.. index:: Ceph Block Device; block device; RBD; Rados Block Device
-
-Ceph Block Device
------------------
-
-A Ceph Block Device stripes a block device image over multiple objects in the
-Ceph Storage Cluster, where each object gets mapped to a placement group and
-distributed, and the placement groups are spread across separate ``ceph-osd``
-daemons throughout the cluster.
-
-.. important:: Striping allows RBD block devices to perform better than a single 
-   server could!
-
-Thin-provisioned snapshottable Ceph Block Devices are an attractive option for
-virtualization and cloud computing. In virtual machine scenarios, people
-typically deploy a Ceph Block Device with the ``rbd`` network storage driver in
-QEMU/KVM, where the host machine uses ``librbd`` to provide a block device
-service to the guest. Many cloud computing stacks use ``libvirt`` to integrate
-with hypervisors. You can use thin-provisioned Ceph Block Devices with QEMU and
-``libvirt`` to support OpenStack and CloudStack among other solutions.
-
-While we do not provide ``librbd`` support with other hypervisors at this time,
-you may also use Ceph Block Device kernel objects to provide a block device to a
-client. Other virtualization technologies such as Xen can access the Ceph Block
-Device kernel object(s). This is done with the  command-line tool ``rbd``.
-
-
-.. index:: Ceph FS; Ceph Filesystem; libcephfs; MDS; metadata server; ceph-mds
-
-Ceph Filesystem
----------------
-
-The Ceph Filesystem (Ceph FS) provides a POSIX-compliant filesystem as a 
-service that is layered on top of the object-based Ceph Storage Cluster.
-Ceph FS files get mapped to objects that Ceph stores in the Ceph Storage
-Cluster. Ceph Clients mount a CephFS filesystem as a kernel object or as
-a Filesystem in User Space (FUSE).
-
-.. ditaa::
-            +-----------------------+  +------------------------+
-            | CephFS Kernel Object  |  |      CephFS FUSE       |
-            +-----------------------+  +------------------------+            
-
-            +---------------------------------------------------+
-            |            Ceph FS Library (libcephfs)            |
-            +---------------------------------------------------+
-
-            +---------------------------------------------------+
-            |      Ceph Storage Cluster Protocol (librados)     |
-            +---------------------------------------------------+
-
-            +---------------+ +---------------+ +---------------+
-            |      OSDs     | |      MDSs     | |    Monitors   |
-            +---------------+ +---------------+ +---------------+
-
-
-The Ceph Filesystem service includes the Ceph Metadata Server (MDS) deployed
-with the Ceph Storage cluster. The purpose of the MDS is to store all the
-filesystem metadata (directories, file ownership, access modes, etc) in
-high-availability Ceph Metadata Servers where the metadata resides in memory.
-The reason for the MDS (a daemon called ``ceph-mds``) is that simple filesystem
-operations like listing a directory or changing a directory (``ls``, ``cd``)
-would tax the Ceph OSD Daemons unnecessarily. So separating the metadata from
-the data means that the Ceph Filesystem can provide high performance services
-without taxing the Ceph Storage Cluster.
-
-Ceph FS separates the metadata from the data, storing the metadata in the MDS, 
-and storing the file data in one or more objects in the Ceph Storage Cluster.
-The Ceph filesystem aims for POSIX compatibility. ``ceph-mds`` can run as a
-single process, or it can be distributed out to multiple physical machines,
-either for high availability or for scalability. 
-
-- **High Availability**: The extra ``ceph-mds`` instances can be `standby`, 
-  ready to take over the duties of any failed ``ceph-mds`` that was
-  `active`. This is easy because all the data, including the journal, is
-  stored on RADOS. The transition is triggered automatically by ``ceph-mon``.
-
-- **Scalability**: Multiple ``ceph-mds`` instances can be `active`, and they
-  will split the directory tree into subtrees (and shards of a single
-  busy directory), effectively balancing the load amongst all `active`
-  servers.
-
-Combinations of `standby` and `active` etc are possible, for example
-running 3 `active` ``ceph-mds`` instances for scaling, and one `standby`
-instance for high availability.
-
-
-
-
-.. _RADOS - A Scalable, Reliable Storage Service for Petabyte-scale Storage Clusters: https://ceph.com/wp-content/uploads/2016/08/weil-rados-pdsw07.pdf
-.. _Paxos: http://en.wikipedia.org/wiki/Paxos_(computer_science)
-.. _Monitor Config Reference: ../rados/configuration/mon-config-ref
-.. _Monitoring OSDs and PGs: ../rados/operations/monitoring-osd-pg
-.. _Heartbeats: ../rados/configuration/mon-osd-interaction
-.. _Monitoring OSDs: ../rados/operations/monitoring-osd-pg/#monitoring-osds
-.. _CRUSH - Controlled, Scalable, Decentralized Placement of Replicated Data: https://ceph.com/wp-content/uploads/2016/08/weil-crush-sc06.pdf
-.. _Data Scrubbing: ../rados/configuration/osd-config-ref#scrubbing
-.. _Report Peering Failure: ../rados/configuration/mon-osd-interaction#osds-report-peering-failure
-.. _Troubleshooting Peering Failure: ../rados/troubleshooting/troubleshooting-pg#placement-group-down-peering-failure
-.. _Ceph Authentication and Authorization: ../rados/operations/auth-intro/
-.. _Hardware Recommendations: ../start/hardware-recommendations
-.. _Network Config Reference: ../rados/configuration/network-config-ref
-.. _Data Scrubbing: ../rados/configuration/osd-config-ref#scrubbing
-.. _striping: http://en.wikipedia.org/wiki/Data_striping
-.. _RAID: http://en.wikipedia.org/wiki/RAID 
-.. _RAID 0: http://en.wikipedia.org/wiki/RAID_0#RAID_0
-.. _Ceph Object Storage: ../radosgw/
-.. _RESTful: http://en.wikipedia.org/wiki/RESTful
-.. _Erasure Code Notes: https://github.com/ceph/ceph/blob/40059e12af88267d0da67d8fd8d9cd81244d8f93/doc/dev/osd_internals/erasure_coding/developer_notes.rst
-.. _Cache Tiering: ../rados/operations/cache-tiering
-.. _Set Pool Values: ../rados/operations/pools#set-pool-values
-.. _Kerberos: http://en.wikipedia.org/wiki/Kerberos_(protocol)
-.. _Cephx Config Guide: ../rados/configuration/auth-config-ref
-.. _User Management: ../rados/operations/user-management