X-Git-Url: https://gerrit.opnfv.org/gerrit/gitweb?a=blobdiff_plain;f=src%2Fceph%2Fdoc%2Farchitecture.rst;fp=src%2Fceph%2Fdoc%2Farchitecture.rst;h=0000000000000000000000000000000000000000;hb=7da45d65be36d36b880cc55c5036e96c24b53f00;hp=f9bdfa2823aba77d3cc550c35627a4e81e3db316;hpb=691462d09d0987b47e112d6ee8740375df3c51b2;p=stor4nfv.git diff --git a/src/ceph/doc/architecture.rst b/src/ceph/doc/architecture.rst deleted file mode 100644 index f9bdfa2..0000000 --- a/src/ceph/doc/architecture.rst +++ /dev/null @@ -1,1602 +0,0 @@ -============== - Architecture -============== - -:term:`Ceph` uniquely delivers **object, block, and file storage** in one -unified system. Ceph is highly reliable, easy to manage, and free. The power of -Ceph can transform your company's IT infrastructure and your ability to manage -vast amounts of data. Ceph delivers extraordinary scalability–thousands of -clients accessing petabytes to exabytes of data. A :term:`Ceph Node` leverages -commodity hardware and intelligent daemons, and a :term:`Ceph Storage Cluster` -accommodates large numbers of nodes, which communicate with each other to -replicate and redistribute data dynamically. - -.. image:: images/stack.png - - -The Ceph Storage Cluster -======================== - -Ceph provides an infinitely scalable :term:`Ceph Storage Cluster` based upon -:abbr:`RADOS (Reliable Autonomic Distributed Object Store)`, which you can read -about in `RADOS - A Scalable, Reliable Storage Service for Petabyte-scale -Storage Clusters`_. - -A Ceph Storage Cluster consists of two types of daemons: - -- :term:`Ceph Monitor` -- :term:`Ceph OSD Daemon` - -.. ditaa:: +---------------+ +---------------+ - | OSDs | | Monitors | - +---------------+ +---------------+ - -A Ceph Monitor maintains a master copy of the cluster map. A cluster of Ceph -monitors ensures high availability should a monitor daemon fail. Storage cluster -clients retrieve a copy of the cluster map from the Ceph Monitor. - -A Ceph OSD Daemon checks its own state and the state of other OSDs and reports -back to monitors. - -Storage cluster clients and each :term:`Ceph OSD Daemon` use the CRUSH algorithm -to efficiently compute information about data location, instead of having to -depend on a central lookup table. Ceph's high-level features include providing a -native interface to the Ceph Storage Cluster via ``librados``, and a number of -service interfaces built on top of ``librados``. - - - -Storing Data ------------- - -The Ceph Storage Cluster receives data from :term:`Ceph Clients`--whether it -comes through a :term:`Ceph Block Device`, :term:`Ceph Object Storage`, the -:term:`Ceph Filesystem` or a custom implementation you create using -``librados``--and it stores the data as objects. Each object corresponds to a -file in a filesystem, which is stored on an :term:`Object Storage Device`. Ceph -OSD Daemons handle the read/write operations on the storage disks. - -.. ditaa:: /-----\ +-----+ +-----+ - | obj |------>| {d} |------>| {s} | - \-----/ +-----+ +-----+ - - Object File Disk - -Ceph OSD Daemons store all data as objects in a flat namespace (e.g., no -hierarchy of directories). An object has an identifier, binary data, and -metadata consisting of a set of name/value pairs. The semantics are completely -up to :term:`Ceph Clients`. For example, CephFS uses metadata to store file -attributes such as the file owner, created date, last modified date, and so -forth. - - -.. ditaa:: /------+------------------------------+----------------\ - | ID | Binary Data | Metadata | - +------+------------------------------+----------------+ - | 1234 | 0101010101010100110101010010 | name1 = value1 | - | | 0101100001010100110101010010 | name2 = value2 | - | | 0101100001010100110101010010 | nameN = valueN | - \------+------------------------------+----------------/ - -.. note:: An object ID is unique across the entire cluster, not just the local - filesystem. - - -.. index:: architecture; high availability, scalability - -Scalability and High Availability ---------------------------------- - -In traditional architectures, clients talk to a centralized component (e.g., a -gateway, broker, API, facade, etc.), which acts as a single point of entry to a -complex subsystem. This imposes a limit to both performance and scalability, -while introducing a single point of failure (i.e., if the centralized component -goes down, the whole system goes down, too). - -Ceph eliminates the centralized gateway to enable clients to interact with -Ceph OSD Daemons directly. Ceph OSD Daemons create object replicas on other -Ceph Nodes to ensure data safety and high availability. Ceph also uses a cluster -of monitors to ensure high availability. To eliminate centralization, Ceph -uses an algorithm called CRUSH. - - -.. index:: CRUSH; architecture - -CRUSH Introduction -~~~~~~~~~~~~~~~~~~ - -Ceph Clients and Ceph OSD Daemons both use the :abbr:`CRUSH (Controlled -Replication Under Scalable Hashing)` algorithm to efficiently compute -information about object location, instead of having to depend on a -central lookup table. CRUSH provides a better data management mechanism compared -to older approaches, and enables massive scale by cleanly distributing the work -to all the clients and OSD daemons in the cluster. CRUSH uses intelligent data -replication to ensure resiliency, which is better suited to hyper-scale storage. -The following sections provide additional details on how CRUSH works. For a -detailed discussion of CRUSH, see `CRUSH - Controlled, Scalable, Decentralized -Placement of Replicated Data`_. - -.. index:: architecture; cluster map - -Cluster Map -~~~~~~~~~~~ - -Ceph depends upon Ceph Clients and Ceph OSD Daemons having knowledge of the -cluster topology, which is inclusive of 5 maps collectively referred to as the -"Cluster Map": - -#. **The Monitor Map:** Contains the cluster ``fsid``, the position, name - address and port of each monitor. It also indicates the current epoch, - when the map was created, and the last time it changed. To view a monitor - map, execute ``ceph mon dump``. - -#. **The OSD Map:** Contains the cluster ``fsid``, when the map was created and - last modified, a list of pools, replica sizes, PG numbers, a list of OSDs - and their status (e.g., ``up``, ``in``). To view an OSD map, execute - ``ceph osd dump``. - -#. **The PG Map:** Contains the PG version, its time stamp, the last OSD - map epoch, the full ratios, and details on each placement group such as - the PG ID, the `Up Set`, the `Acting Set`, the state of the PG (e.g., - ``active + clean``), and data usage statistics for each pool. - -#. **The CRUSH Map:** Contains a list of storage devices, the failure domain - hierarchy (e.g., device, host, rack, row, room, etc.), and rules for - traversing the hierarchy when storing data. To view a CRUSH map, execute - ``ceph osd getcrushmap -o {filename}``; then, decompile it by executing - ``crushtool -d {comp-crushmap-filename} -o {decomp-crushmap-filename}``. - You can view the decompiled map in a text editor or with ``cat``. - -#. **The MDS Map:** Contains the current MDS map epoch, when the map was - created, and the last time it changed. It also contains the pool for - storing metadata, a list of metadata servers, and which metadata servers - are ``up`` and ``in``. To view an MDS map, execute ``ceph fs dump``. - -Each map maintains an iterative history of its operating state changes. Ceph -Monitors maintain a master copy of the cluster map including the cluster -members, state, changes, and the overall health of the Ceph Storage Cluster. - -.. index:: high availability; monitor architecture - -High Availability Monitors -~~~~~~~~~~~~~~~~~~~~~~~~~~ - -Before Ceph Clients can read or write data, they must contact a Ceph Monitor -to obtain the most recent copy of the cluster map. A Ceph Storage Cluster -can operate with a single monitor; however, this introduces a single -point of failure (i.e., if the monitor goes down, Ceph Clients cannot -read or write data). - -For added reliability and fault tolerance, Ceph supports a cluster of monitors. -In a cluster of monitors, latency and other faults can cause one or more -monitors to fall behind the current state of the cluster. For this reason, Ceph -must have agreement among various monitor instances regarding the state of the -cluster. Ceph always uses a majority of monitors (e.g., 1, 2:3, 3:5, 4:6, etc.) -and the `Paxos`_ algorithm to establish a consensus among the monitors about the -current state of the cluster. - -For details on configuring monitors, see the `Monitor Config Reference`_. - -.. index:: architecture; high availability authentication - -High Availability Authentication -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -To identify users and protect against man-in-the-middle attacks, Ceph provides -its ``cephx`` authentication system to authenticate users and daemons. - -.. note:: The ``cephx`` protocol does not address data encryption in transport - (e.g., SSL/TLS) or encryption at rest. - -Cephx uses shared secret keys for authentication, meaning both the client and -the monitor cluster have a copy of the client's secret key. The authentication -protocol is such that both parties are able to prove to each other they have a -copy of the key without actually revealing it. This provides mutual -authentication, which means the cluster is sure the user possesses the secret -key, and the user is sure that the cluster has a copy of the secret key. - -A key scalability feature of Ceph is to avoid a centralized interface to the -Ceph object store, which means that Ceph clients must be able to interact with -OSDs directly. To protect data, Ceph provides its ``cephx`` authentication -system, which authenticates users operating Ceph clients. The ``cephx`` protocol -operates in a manner with behavior similar to `Kerberos`_. - -A user/actor invokes a Ceph client to contact a monitor. Unlike Kerberos, each -monitor can authenticate users and distribute keys, so there is no single point -of failure or bottleneck when using ``cephx``. The monitor returns an -authentication data structure similar to a Kerberos ticket that contains a -session key for use in obtaining Ceph services. This session key is itself -encrypted with the user's permanent secret key, so that only the user can -request services from the Ceph Monitor(s). The client then uses the session key -to request its desired services from the monitor, and the monitor provides the -client with a ticket that will authenticate the client to the OSDs that actually -handle data. Ceph Monitors and OSDs share a secret, so the client can use the -ticket provided by the monitor with any OSD or metadata server in the cluster. -Like Kerberos, ``cephx`` tickets expire, so an attacker cannot use an expired -ticket or session key obtained surreptitiously. This form of authentication will -prevent attackers with access to the communications medium from either creating -bogus messages under another user's identity or altering another user's -legitimate messages, as long as the user's secret key is not divulged before it -expires. - -To use ``cephx``, an administrator must set up users first. In the following -diagram, the ``client.admin`` user invokes ``ceph auth get-or-create-key`` from -the command line to generate a username and secret key. Ceph's ``auth`` -subsystem generates the username and key, stores a copy with the monitor(s) and -transmits the user's secret back to the ``client.admin`` user. This means that -the client and the monitor share a secret key. - -.. note:: The ``client.admin`` user must provide the user ID and - secret key to the user in a secure manner. - -.. ditaa:: +---------+ +---------+ - | Client | | Monitor | - +---------+ +---------+ - | request to | - | create a user | - |-------------->|----------+ create user - | | | and - |<--------------|<---------+ store key - | transmit key | - | | - - -To authenticate with the monitor, the client passes in the user name to the -monitor, and the monitor generates a session key and encrypts it with the secret -key associated to the user name. Then, the monitor transmits the encrypted -ticket back to the client. The client then decrypts the payload with the shared -secret key to retrieve the session key. The session key identifies the user for -the current session. The client then requests a ticket on behalf of the user -signed by the session key. The monitor generates a ticket, encrypts it with the -user's secret key and transmits it back to the client. The client decrypts the -ticket and uses it to sign requests to OSDs and metadata servers throughout the -cluster. - -.. ditaa:: +---------+ +---------+ - | Client | | Monitor | - +---------+ +---------+ - | authenticate | - |-------------->|----------+ generate and - | | | encrypt - |<--------------|<---------+ session key - | transmit | - | encrypted | - | session key | - | | - |-----+ decrypt | - | | session | - |<----+ key | - | | - | req. ticket | - |-------------->|----------+ generate and - | | | encrypt - |<--------------|<---------+ ticket - | recv. ticket | - | | - |-----+ decrypt | - | | ticket | - |<----+ | - - -The ``cephx`` protocol authenticates ongoing communications between the client -machine and the Ceph servers. Each message sent between a client and server, -subsequent to the initial authentication, is signed using a ticket that the -monitors, OSDs and metadata servers can verify with their shared secret. - -.. ditaa:: +---------+ +---------+ +-------+ +-------+ - | Client | | Monitor | | MDS | | OSD | - +---------+ +---------+ +-------+ +-------+ - | request to | | | - | create a user | | | - |-------------->| mon and | | - |<--------------| client share | | - | receive | a secret. | | - | shared secret | | | - | |<------------>| | - | |<-------------+------------>| - | | mon, mds, | | - | authenticate | and osd | | - |-------------->| share | | - |<--------------| a secret | | - | session key | | | - | | | | - | req. ticket | | | - |-------------->| | | - |<--------------| | | - | recv. ticket | | | - | | | | - | make request (CephFS only) | | - |----------------------------->| | - |<-----------------------------| | - | receive response (CephFS only) | - | | - | make request | - |------------------------------------------->| - |<-------------------------------------------| - receive response - -The protection offered by this authentication is between the Ceph client and the -Ceph server hosts. The authentication is not extended beyond the Ceph client. If -the user accesses the Ceph client from a remote host, Ceph authentication is not -applied to the connection between the user's host and the client host. - - -For configuration details, see `Cephx Config Guide`_. For user management -details, see `User Management`_. - - -.. index:: architecture; smart daemons and scalability - -Smart Daemons Enable Hyperscale -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -In many clustered architectures, the primary purpose of cluster membership is -so that a centralized interface knows which nodes it can access. Then the -centralized interface provides services to the client through a double -dispatch--which is a **huge** bottleneck at the petabyte-to-exabyte scale. - -Ceph eliminates the bottleneck: Ceph's OSD Daemons AND Ceph Clients are cluster -aware. Like Ceph clients, each Ceph OSD Daemon knows about other Ceph OSD -Daemons in the cluster. This enables Ceph OSD Daemons to interact directly with -other Ceph OSD Daemons and Ceph Monitors. Additionally, it enables Ceph Clients -to interact directly with Ceph OSD Daemons. - -The ability of Ceph Clients, Ceph Monitors and Ceph OSD Daemons to interact with -each other means that Ceph OSD Daemons can utilize the CPU and RAM of the Ceph -nodes to easily perform tasks that would bog down a centralized server. The -ability to leverage this computing power leads to several major benefits: - -#. **OSDs Service Clients Directly:** Since any network device has a limit to - the number of concurrent connections it can support, a centralized system - has a low physical limit at high scales. By enabling Ceph Clients to contact - Ceph OSD Daemons directly, Ceph increases both performance and total system - capacity simultaneously, while removing a single point of failure. Ceph - Clients can maintain a session when they need to, and with a particular Ceph - OSD Daemon instead of a centralized server. - -#. **OSD Membership and Status**: Ceph OSD Daemons join a cluster and report - on their status. At the lowest level, the Ceph OSD Daemon status is ``up`` - or ``down`` reflecting whether or not it is running and able to service - Ceph Client requests. If a Ceph OSD Daemon is ``down`` and ``in`` the Ceph - Storage Cluster, this status may indicate the failure of the Ceph OSD - Daemon. If a Ceph OSD Daemon is not running (e.g., it crashes), the Ceph OSD - Daemon cannot notify the Ceph Monitor that it is ``down``. The OSDs - periodically send messages to the Ceph Monitor (``MPGStats`` pre-luminous, - and a new ``MOSDBeacon`` in luminous). If the Ceph Monitor doesn't see that - message after a configurable period of time then it marks the OSD down. - This mechanism is a failsafe, however. Normally, Ceph OSD Daemons will - determine if a neighboring OSD is down and report it to the Ceph Monitor(s). - This assures that Ceph Monitors are lightweight processes. See `Monitoring - OSDs`_ and `Heartbeats`_ for additional details. - -#. **Data Scrubbing:** As part of maintaining data consistency and cleanliness, - Ceph OSD Daemons can scrub objects within placement groups. That is, Ceph - OSD Daemons can compare object metadata in one placement group with its - replicas in placement groups stored on other OSDs. Scrubbing (usually - performed daily) catches bugs or filesystem errors. Ceph OSD Daemons also - perform deeper scrubbing by comparing data in objects bit-for-bit. Deep - scrubbing (usually performed weekly) finds bad sectors on a drive that - weren't apparent in a light scrub. See `Data Scrubbing`_ for details on - configuring scrubbing. - -#. **Replication:** Like Ceph Clients, Ceph OSD Daemons use the CRUSH - algorithm, but the Ceph OSD Daemon uses it to compute where replicas of - objects should be stored (and for rebalancing). In a typical write scenario, - a client uses the CRUSH algorithm to compute where to store an object, maps - the object to a pool and placement group, then looks at the CRUSH map to - identify the primary OSD for the placement group. - - The client writes the object to the identified placement group in the - primary OSD. Then, the primary OSD with its own copy of the CRUSH map - identifies the secondary and tertiary OSDs for replication purposes, and - replicates the object to the appropriate placement groups in the secondary - and tertiary OSDs (as many OSDs as additional replicas), and responds to the - client once it has confirmed the object was stored successfully. - -.. ditaa:: - +----------+ - | Client | - | | - +----------+ - * ^ - Write (1) | | Ack (6) - | | - v * - +-------------+ - | Primary OSD | - | | - +-------------+ - * ^ ^ * - Write (2) | | | | Write (3) - +------+ | | +------+ - | +------+ +------+ | - | | Ack (4) Ack (5)| | - v * * v - +---------------+ +---------------+ - | Secondary OSD | | Tertiary OSD | - | | | | - +---------------+ +---------------+ - -With the ability to perform data replication, Ceph OSD Daemons relieve Ceph -clients from that duty, while ensuring high data availability and data safety. - - -Dynamic Cluster Management --------------------------- - -In the `Scalability and High Availability`_ section, we explained how Ceph uses -CRUSH, cluster awareness and intelligent daemons to scale and maintain high -availability. Key to Ceph's design is the autonomous, self-healing, and -intelligent Ceph OSD Daemon. Let's take a deeper look at how CRUSH works to -enable modern cloud storage infrastructures to place data, rebalance the cluster -and recover from faults dynamically. - -.. index:: architecture; pools - -About Pools -~~~~~~~~~~~ - -The Ceph storage system supports the notion of 'Pools', which are logical -partitions for storing objects. - -Ceph Clients retrieve a `Cluster Map`_ from a Ceph Monitor, and write objects to -pools. The pool's ``size`` or number of replicas, the CRUSH ruleset and the -number of placement groups determine how Ceph will place the data. - -.. ditaa:: - +--------+ Retrieves +---------------+ - | Client |------------>| Cluster Map | - +--------+ +---------------+ - | - v Writes - /-----\ - | obj | - \-----/ - | To - v - +--------+ +---------------+ - | Pool |---------->| CRUSH Ruleset | - +--------+ Selects +---------------+ - - -Pools set at least the following parameters: - -- Ownership/Access to Objects -- The Number of Placement Groups, and -- The CRUSH Ruleset to Use. - -See `Set Pool Values`_ for details. - - -.. index: architecture; placement group mapping - -Mapping PGs to OSDs -~~~~~~~~~~~~~~~~~~~ - -Each pool has a number of placement groups. CRUSH maps PGs to OSDs dynamically. -When a Ceph Client stores objects, CRUSH will map each object to a placement -group. - -Mapping objects to placement groups creates a layer of indirection between the -Ceph OSD Daemon and the Ceph Client. The Ceph Storage Cluster must be able to -grow (or shrink) and rebalance where it stores objects dynamically. If the Ceph -Client "knew" which Ceph OSD Daemon had which object, that would create a tight -coupling between the Ceph Client and the Ceph OSD Daemon. Instead, the CRUSH -algorithm maps each object to a placement group and then maps each placement -group to one or more Ceph OSD Daemons. This layer of indirection allows Ceph to -rebalance dynamically when new Ceph OSD Daemons and the underlying OSD devices -come online. The following diagram depicts how CRUSH maps objects to placement -groups, and placement groups to OSDs. - -.. ditaa:: - /-----\ /-----\ /-----\ /-----\ /-----\ - | obj | | obj | | obj | | obj | | obj | - \-----/ \-----/ \-----/ \-----/ \-----/ - | | | | | - +--------+--------+ +---+----+ - | | - v v - +-----------------------+ +-----------------------+ - | Placement Group #1 | | Placement Group #2 | - | | | | - +-----------------------+ +-----------------------+ - | | - | +-----------------------+---+ - +------+------+-------------+ | - | | | | - v v v v - /----------\ /----------\ /----------\ /----------\ - | | | | | | | | - | OSD #1 | | OSD #2 | | OSD #3 | | OSD #4 | - | | | | | | | | - \----------/ \----------/ \----------/ \----------/ - -With a copy of the cluster map and the CRUSH algorithm, the client can compute -exactly which OSD to use when reading or writing a particular object. - -.. index:: architecture; calculating PG IDs - -Calculating PG IDs -~~~~~~~~~~~~~~~~~~ - -When a Ceph Client binds to a Ceph Monitor, it retrieves the latest copy of the -`Cluster Map`_. With the cluster map, the client knows about all of the monitors, -OSDs, and metadata servers in the cluster. **However, it doesn't know anything -about object locations.** - -.. epigraph:: - - Object locations get computed. - - -The only input required by the client is the object ID and the pool. -It's simple: Ceph stores data in named pools (e.g., "liverpool"). When a client -wants to store a named object (e.g., "john," "paul," "george," "ringo", etc.) -it calculates a placement group using the object name, a hash code, the -number of PGs in the pool and the pool name. Ceph clients use the following -steps to compute PG IDs. - -#. The client inputs the pool ID and the object ID. (e.g., pool = "liverpool" - and object-id = "john") -#. Ceph takes the object ID and hashes it. -#. Ceph calculates the hash modulo the number of PGs. (e.g., ``58``) to get - a PG ID. -#. Ceph gets the pool ID given the pool name (e.g., "liverpool" = ``4``) -#. Ceph prepends the pool ID to the PG ID (e.g., ``4.58``). - -Computing object locations is much faster than performing object location query -over a chatty session. The :abbr:`CRUSH (Controlled Replication Under Scalable -Hashing)` algorithm allows a client to compute where objects *should* be stored, -and enables the client to contact the primary OSD to store or retrieve the -objects. - -.. index:: architecture; PG Peering - -Peering and Sets -~~~~~~~~~~~~~~~~ - -In previous sections, we noted that Ceph OSD Daemons check each others -heartbeats and report back to the Ceph Monitor. Another thing Ceph OSD daemons -do is called 'peering', which is the process of bringing all of the OSDs that -store a Placement Group (PG) into agreement about the state of all of the -objects (and their metadata) in that PG. In fact, Ceph OSD Daemons `Report -Peering Failure`_ to the Ceph Monitors. Peering issues usually resolve -themselves; however, if the problem persists, you may need to refer to the -`Troubleshooting Peering Failure`_ section. - -.. Note:: Agreeing on the state does not mean that the PGs have the latest contents. - -The Ceph Storage Cluster was designed to store at least two copies of an object -(i.e., ``size = 2``), which is the minimum requirement for data safety. For high -availability, a Ceph Storage Cluster should store more than two copies of an object -(e.g., ``size = 3`` and ``min size = 2``) so that it can continue to run in a -``degraded`` state while maintaining data safety. - -Referring back to the diagram in `Smart Daemons Enable Hyperscale`_, we do not -name the Ceph OSD Daemons specifically (e.g., ``osd.0``, ``osd.1``, etc.), but -rather refer to them as *Primary*, *Secondary*, and so forth. By convention, -the *Primary* is the first OSD in the *Acting Set*, and is responsible for -coordinating the peering process for each placement group where it acts as -the *Primary*, and is the **ONLY** OSD that that will accept client-initiated -writes to objects for a given placement group where it acts as the *Primary*. - -When a series of OSDs are responsible for a placement group, that series of -OSDs, we refer to them as an *Acting Set*. An *Acting Set* may refer to the Ceph -OSD Daemons that are currently responsible for the placement group, or the Ceph -OSD Daemons that were responsible for a particular placement group as of some -epoch. - -The Ceph OSD daemons that are part of an *Acting Set* may not always be ``up``. -When an OSD in the *Acting Set* is ``up``, it is part of the *Up Set*. The *Up -Set* is an important distinction, because Ceph can remap PGs to other Ceph OSD -Daemons when an OSD fails. - -.. note:: In an *Acting Set* for a PG containing ``osd.25``, ``osd.32`` and - ``osd.61``, the first OSD, ``osd.25``, is the *Primary*. If that OSD fails, - the Secondary, ``osd.32``, becomes the *Primary*, and ``osd.25`` will be - removed from the *Up Set*. - - -.. index:: architecture; Rebalancing - -Rebalancing -~~~~~~~~~~~ - -When you add a Ceph OSD Daemon to a Ceph Storage Cluster, the cluster map gets -updated with the new OSD. Referring back to `Calculating PG IDs`_, this changes -the cluster map. Consequently, it changes object placement, because it changes -an input for the calculations. The following diagram depicts the rebalancing -process (albeit rather crudely, since it is substantially less impactful with -large clusters) where some, but not all of the PGs migrate from existing OSDs -(OSD 1, and OSD 2) to the new OSD (OSD 3). Even when rebalancing, CRUSH is -stable. Many of the placement groups remain in their original configuration, -and each OSD gets some added capacity, so there are no load spikes on the -new OSD after rebalancing is complete. - - -.. ditaa:: - +--------+ +--------+ - Before | OSD 1 | | OSD 2 | - +--------+ +--------+ - | PG #1 | | PG #6 | - | PG #2 | | PG #7 | - | PG #3 | | PG #8 | - | PG #4 | | PG #9 | - | PG #5 | | PG #10 | - +--------+ +--------+ - - +--------+ +--------+ +--------+ - After | OSD 1 | | OSD 2 | | OSD 3 | - +--------+ +--------+ +--------+ - | PG #1 | | PG #7 | | PG #3 | - | PG #2 | | PG #8 | | PG #6 | - | PG #4 | | PG #10 | | PG #9 | - | PG #5 | | | | | - | | | | | | - +--------+ +--------+ +--------+ - - -.. index:: architecture; Data Scrubbing - -Data Consistency -~~~~~~~~~~~~~~~~ - -As part of maintaining data consistency and cleanliness, Ceph OSDs can also -scrub objects within placement groups. That is, Ceph OSDs can compare object -metadata in one placement group with its replicas in placement groups stored in -other OSDs. Scrubbing (usually performed daily) catches OSD bugs or filesystem -errors. OSDs can also perform deeper scrubbing by comparing data in objects -bit-for-bit. Deep scrubbing (usually performed weekly) finds bad sectors on a -disk that weren't apparent in a light scrub. - -See `Data Scrubbing`_ for details on configuring scrubbing. - - - - - -.. index:: erasure coding - -Erasure Coding --------------- - -An erasure coded pool stores each object as ``K+M`` chunks. It is divided into -``K`` data chunks and ``M`` coding chunks. The pool is configured to have a size -of ``K+M`` so that each chunk is stored in an OSD in the acting set. The rank of -the chunk is stored as an attribute of the object. - -For instance an erasure coded pool is created to use five OSDs (``K+M = 5``) and -sustain the loss of two of them (``M = 2``). - -Reading and Writing Encoded Chunks -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -When the object **NYAN** containing ``ABCDEFGHI`` is written to the pool, the erasure -encoding function splits the content into three data chunks simply by dividing -the content in three: the first contains ``ABC``, the second ``DEF`` and the -last ``GHI``. The content will be padded if the content length is not a multiple -of ``K``. The function also creates two coding chunks: the fourth with ``YXY`` -and the fifth with ``GQC``. Each chunk is stored in an OSD in the acting set. -The chunks are stored in objects that have the same name (**NYAN**) but reside -on different OSDs. The order in which the chunks were created must be preserved -and is stored as an attribute of the object (``shard_t``), in addition to its -name. Chunk 1 contains ``ABC`` and is stored on **OSD5** while chunk 4 contains -``YXY`` and is stored on **OSD3**. - - -.. ditaa:: - +-------------------+ - name | NYAN | - +-------------------+ - content | ABCDEFGHI | - +--------+----------+ - | - | - v - +------+------+ - +---------------+ encode(3,2) +-----------+ - | +--+--+---+---+ | - | | | | | - | +-------+ | +-----+ | - | | | | | - +--v---+ +--v---+ +--v---+ +--v---+ +--v---+ - name | NYAN | | NYAN | | NYAN | | NYAN | | NYAN | - +------+ +------+ +------+ +------+ +------+ - shard | 1 | | 2 | | 3 | | 4 | | 5 | - +------+ +------+ +------+ +------+ +------+ - content | ABC | | DEF | | GHI | | YXY | | QGC | - +--+---+ +--+---+ +--+---+ +--+---+ +--+---+ - | | | | | - | | v | | - | | +--+---+ | | - | | | OSD1 | | | - | | +------+ | | - | | | | - | | +------+ | | - | +------>| OSD2 | | | - | +------+ | | - | | | - | +------+ | | - | | OSD3 |<----+ | - | +------+ | - | | - | +------+ | - | | OSD4 |<--------------+ - | +------+ - | - | +------+ - +----------------->| OSD5 | - +------+ - - -When the object **NYAN** is read from the erasure coded pool, the decoding -function reads three chunks: chunk 1 containing ``ABC``, chunk 3 containing -``GHI`` and chunk 4 containing ``YXY``. Then, it rebuilds the original content -of the object ``ABCDEFGHI``. The decoding function is informed that the chunks 2 -and 5 are missing (they are called 'erasures'). The chunk 5 could not be read -because the **OSD4** is out. The decoding function can be called as soon as -three chunks are read: **OSD2** was the slowest and its chunk was not taken into -account. - -.. ditaa:: - +-------------------+ - name | NYAN | - +-------------------+ - content | ABCDEFGHI | - +---------+---------+ - ^ - | - | - +-------+-------+ - | decode(3,2) | - +------------->+ erasures 2,5 +<-+ - | | | | - | +-------+-------+ | - | ^ | - | | | - | | | - +--+---+ +------+ +---+--+ +---+--+ - name | NYAN | | NYAN | | NYAN | | NYAN | - +------+ +------+ +------+ +------+ - shard | 1 | | 2 | | 3 | | 4 | - +------+ +------+ +------+ +------+ - content | ABC | | DEF | | GHI | | YXY | - +--+---+ +--+---+ +--+---+ +--+---+ - ^ . ^ ^ - | TOO . | | - | SLOW . +--+---+ | - | ^ | OSD1 | | - | | +------+ | - | | | - | | +------+ | - | +-------| OSD2 | | - | +------+ | - | | - | +------+ | - | | OSD3 |------+ - | +------+ - | - | +------+ - | | OSD4 | OUT - | +------+ - | - | +------+ - +------------------| OSD5 | - +------+ - - -Interrupted Full Writes -~~~~~~~~~~~~~~~~~~~~~~~ - -In an erasure coded pool, the primary OSD in the up set receives all write -operations. It is responsible for encoding the payload into ``K+M`` chunks and -sends them to the other OSDs. It is also responsible for maintaining an -authoritative version of the placement group logs. - -In the following diagram, an erasure coded placement group has been created with -``K = 2 + M = 1`` and is supported by three OSDs, two for ``K`` and one for -``M``. The acting set of the placement group is made of **OSD 1**, **OSD 2** and -**OSD 3**. An object has been encoded and stored in the OSDs : the chunk -``D1v1`` (i.e. Data chunk number 1, version 1) is on **OSD 1**, ``D2v1`` on -**OSD 2** and ``C1v1`` (i.e. Coding chunk number 1, version 1) on **OSD 3**. The -placement group logs on each OSD are identical (i.e. ``1,1`` for epoch 1, -version 1). - - -.. ditaa:: - Primary OSD - - +-------------+ - | OSD 1 | +-------------+ - | log | Write Full | | - | +----+ |<------------+ Ceph Client | - | |D1v1| 1,1 | v1 | | - | +----+ | +-------------+ - +------+------+ - | - | - | +-------------+ - | | OSD 2 | - | | log | - +--------->+ +----+ | - | | |D2v1| 1,1 | - | | +----+ | - | +-------------+ - | - | +-------------+ - | | OSD 3 | - | | log | - +--------->| +----+ | - | |C1v1| 1,1 | - | +----+ | - +-------------+ - -**OSD 1** is the primary and receives a **WRITE FULL** from a client, which -means the payload is to replace the object entirely instead of overwriting a -portion of it. Version 2 (v2) of the object is created to override version 1 -(v1). **OSD 1** encodes the payload into three chunks: ``D1v2`` (i.e. Data -chunk number 1 version 2) will be on **OSD 1**, ``D2v2`` on **OSD 2** and -``C1v2`` (i.e. Coding chunk number 1 version 2) on **OSD 3**. Each chunk is sent -to the target OSD, including the primary OSD which is responsible for storing -chunks in addition to handling write operations and maintaining an authoritative -version of the placement group logs. When an OSD receives the message -instructing it to write the chunk, it also creates a new entry in the placement -group logs to reflect the change. For instance, as soon as **OSD 3** stores -``C1v2``, it adds the entry ``1,2`` ( i.e. epoch 1, version 2 ) to its logs. -Because the OSDs work asynchronously, some chunks may still be in flight ( such -as ``D2v2`` ) while others are acknowledged and on disk ( such as ``C1v1`` and -``D1v1``). - -.. ditaa:: - - Primary OSD - - +-------------+ - | OSD 1 | - | log | - | +----+ | +-------------+ - | |D1v2| 1,2 | Write Full | | - | +----+ +<------------+ Ceph Client | - | | v2 | | - | +----+ | +-------------+ - | |D1v1| 1,1 | - | +----+ | - +------+------+ - | - | - | +------+------+ - | | OSD 2 | - | +------+ | log | - +->| D2v2 | | +----+ | - | +------+ | |D2v1| 1,1 | - | | +----+ | - | +-------------+ - | - | +-------------+ - | | OSD 3 | - | | log | - | | +----+ | - | | |C1v2| 1,2 | - +---------->+ +----+ | - | | - | +----+ | - | |C1v1| 1,1 | - | +----+ | - +-------------+ - - -If all goes well, the chunks are acknowledged on each OSD in the acting set and -the logs' ``last_complete`` pointer can move from ``1,1`` to ``1,2``. - -.. ditaa:: - - Primary OSD - - +-------------+ - | OSD 1 | - | log | - | +----+ | +-------------+ - | |D1v2| 1,2 | Write Full | | - | +----+ +<------------+ Ceph Client | - | | v2 | | - | +----+ | +-------------+ - | |D1v1| 1,1 | - | +----+ | - +------+------+ - | - | +-------------+ - | | OSD 2 | - | | log | - | | +----+ | - | | |D2v2| 1,2 | - +---------->+ +----+ | - | | | - | | +----+ | - | | |D2v1| 1,1 | - | | +----+ | - | +-------------+ - | - | +-------------+ - | | OSD 3 | - | | log | - | | +----+ | - | | |C1v2| 1,2 | - +---------->+ +----+ | - | | - | +----+ | - | |C1v1| 1,1 | - | +----+ | - +-------------+ - - -Finally, the files used to store the chunks of the previous version of the -object can be removed: ``D1v1`` on **OSD 1**, ``D2v1`` on **OSD 2** and ``C1v1`` -on **OSD 3**. - -.. ditaa:: - Primary OSD - - +-------------+ - | OSD 1 | - | log | - | +----+ | - | |D1v2| 1,2 | - | +----+ | - +------+------+ - | - | - | +-------------+ - | | OSD 2 | - | | log | - +--------->+ +----+ | - | | |D2v2| 1,2 | - | | +----+ | - | +-------------+ - | - | +-------------+ - | | OSD 3 | - | | log | - +--------->| +----+ | - | |C1v2| 1,2 | - | +----+ | - +-------------+ - - -But accidents happen. If **OSD 1** goes down while ``D2v2`` is still in flight, -the object's version 2 is partially written: **OSD 3** has one chunk but that is -not enough to recover. It lost two chunks: ``D1v2`` and ``D2v2`` and the -erasure coding parameters ``K = 2``, ``M = 1`` require that at least two chunks are -available to rebuild the third. **OSD 4** becomes the new primary and finds that -the ``last_complete`` log entry (i.e., all objects before this entry were known -to be available on all OSDs in the previous acting set ) is ``1,1`` and that -will be the head of the new authoritative log. - -.. ditaa:: - +-------------+ - | OSD 1 | - | (down) | - | c333 | - +------+------+ - | - | +-------------+ - | | OSD 2 | - | | log | - | | +----+ | - +---------->+ |D2v1| 1,1 | - | | +----+ | - | | | - | +-------------+ - | - | +-------------+ - | | OSD 3 | - | | log | - | | +----+ | - | | |C1v2| 1,2 | - +---------->+ +----+ | - | | - | +----+ | - | |C1v1| 1,1 | - | +----+ | - +-------------+ - Primary OSD - +-------------+ - | OSD 4 | - | log | - | | - | 1,1 | - | | - +------+------+ - - - -The log entry 1,2 found on **OSD 3** is divergent from the new authoritative log -provided by **OSD 4**: it is discarded and the file containing the ``C1v2`` -chunk is removed. The ``D1v1`` chunk is rebuilt with the ``decode`` function of -the erasure coding library during scrubbing and stored on the new primary -**OSD 4**. - - -.. ditaa:: - Primary OSD - - +-------------+ - | OSD 4 | - | log | - | +----+ | - | |D1v1| 1,1 | - | +----+ | - +------+------+ - ^ - | - | +-------------+ - | | OSD 2 | - | | log | - +----------+ +----+ | - | | |D2v1| 1,1 | - | | +----+ | - | +-------------+ - | - | +-------------+ - | | OSD 3 | - | | log | - +----------| +----+ | - | |C1v1| 1,1 | - | +----+ | - +-------------+ - - +-------------+ - | OSD 1 | - | (down) | - | c333 | - +-------------+ - -See `Erasure Code Notes`_ for additional details. - - - -Cache Tiering -------------- - -A cache tier provides Ceph Clients with better I/O performance for a subset of -the data stored in a backing storage tier. Cache tiering involves creating a -pool of relatively fast/expensive storage devices (e.g., solid state drives) -configured to act as a cache tier, and a backing pool of either erasure-coded -or relatively slower/cheaper devices configured to act as an economical storage -tier. The Ceph objecter handles where to place the objects and the tiering -agent determines when to flush objects from the cache to the backing storage -tier. So the cache tier and the backing storage tier are completely transparent -to Ceph clients. - - -.. ditaa:: - +-------------+ - | Ceph Client | - +------+------+ - ^ - Tiering is | - Transparent | Faster I/O - to Ceph | +---------------+ - Client Ops | | | - | +----->+ Cache Tier | - | | | | - | | +-----+---+-----+ - | | | ^ - v v | | Active Data in Cache Tier - +------+----+--+ | | - | Objecter | | | - +-----------+--+ | | - ^ | | Inactive Data in Storage Tier - | v | - | +-----+---+-----+ - | | | - +----->| Storage Tier | - | | - +---------------+ - Slower I/O - -See `Cache Tiering`_ for additional details. - - -.. index:: Extensibility, Ceph Classes - -Extending Ceph --------------- - -You can extend Ceph by creating shared object classes called 'Ceph Classes'. -Ceph loads ``.so`` classes stored in the ``osd class dir`` directory dynamically -(i.e., ``$libdir/rados-classes`` by default). When you implement a class, you -can create new object methods that have the ability to call the native methods -in the Ceph Object Store, or other class methods you incorporate via libraries -or create yourself. - -On writes, Ceph Classes can call native or class methods, perform any series of -operations on the inbound data and generate a resulting write transaction that -Ceph will apply atomically. - -On reads, Ceph Classes can call native or class methods, perform any series of -operations on the outbound data and return the data to the client. - -.. topic:: Ceph Class Example - - A Ceph class for a content management system that presents pictures of a - particular size and aspect ratio could take an inbound bitmap image, crop it - to a particular aspect ratio, resize it and embed an invisible copyright or - watermark to help protect the intellectual property; then, save the - resulting bitmap image to the object store. - -See ``src/objclass/objclass.h``, ``src/fooclass.cc`` and ``src/barclass`` for -exemplary implementations. - - -Summary -------- - -Ceph Storage Clusters are dynamic--like a living organism. Whereas, many storage -appliances do not fully utilize the CPU and RAM of a typical commodity server, -Ceph does. From heartbeats, to peering, to rebalancing the cluster or -recovering from faults, Ceph offloads work from clients (and from a centralized -gateway which doesn't exist in the Ceph architecture) and uses the computing -power of the OSDs to perform the work. When referring to `Hardware -Recommendations`_ and the `Network Config Reference`_, be cognizant of the -foregoing concepts to understand how Ceph utilizes computing resources. - -.. index:: Ceph Protocol, librados - -Ceph Protocol -============= - -Ceph Clients use the native protocol for interacting with the Ceph Storage -Cluster. Ceph packages this functionality into the ``librados`` library so that -you can create your own custom Ceph Clients. The following diagram depicts the -basic architecture. - -.. ditaa:: - +---------------------------------+ - | Ceph Storage Cluster Protocol | - | (librados) | - +---------------------------------+ - +---------------+ +---------------+ - | OSDs | | Monitors | - +---------------+ +---------------+ - - -Native Protocol and ``librados`` --------------------------------- - -Modern applications need a simple object storage interface with asynchronous -communication capability. The Ceph Storage Cluster provides a simple object -storage interface with asynchronous communication capability. The interface -provides direct, parallel access to objects throughout the cluster. - - -- Pool Operations -- Snapshots and Copy-on-write Cloning -- Read/Write Objects - - Create or Remove - - Entire Object or Byte Range - - Append or Truncate -- Create/Set/Get/Remove XATTRs -- Create/Set/Get/Remove Key/Value Pairs -- Compound operations and dual-ack semantics -- Object Classes - - -.. index:: architecture; watch/notify - -Object Watch/Notify -------------------- - -A client can register a persistent interest with an object and keep a session to -the primary OSD open. The client can send a notification message and a payload to -all watchers and receive notification when the watchers receive the -notification. This enables a client to use any object as a -synchronization/communication channel. - - -.. ditaa:: +----------+ +----------+ +----------+ +---------------+ - | Client 1 | | Client 2 | | Client 3 | | OSD:Object ID | - +----------+ +----------+ +----------+ +---------------+ - | | | | - | | | | - | | Watch Object | | - |--------------------------------------------------->| - | | | | - |<---------------------------------------------------| - | | Ack/Commit | | - | | | | - | | Watch Object | | - | |---------------------------------->| - | | | | - | |<----------------------------------| - | | Ack/Commit | | - | | | Watch Object | - | | |----------------->| - | | | | - | | |<-----------------| - | | | Ack/Commit | - | | Notify | | - |--------------------------------------------------->| - | | | | - |<---------------------------------------------------| - | | Notify | | - | | | | - | |<----------------------------------| - | | Notify | | - | | |<-----------------| - | | | Notify | - | | Ack | | - |----------------+---------------------------------->| - | | | | - | | Ack | | - | +---------------------------------->| - | | | | - | | | Ack | - | | |----------------->| - | | | | - |<---------------+----------------+------------------| - | Complete - -.. index:: architecture; Striping - -Data Striping -------------- - -Storage devices have throughput limitations, which impact performance and -scalability. So storage systems often support `striping`_--storing sequential -pieces of information across multiple storage devices--to increase throughput -and performance. The most common form of data striping comes from `RAID`_. -The RAID type most similar to Ceph's striping is `RAID 0`_, or a 'striped -volume'. Ceph's striping offers the throughput of RAID 0 striping, the -reliability of n-way RAID mirroring and faster recovery. - -Ceph provides three types of clients: Ceph Block Device, Ceph Filesystem, and -Ceph Object Storage. A Ceph Client converts its data from the representation -format it provides to its users (a block device image, RESTful objects, CephFS -filesystem directories) into objects for storage in the Ceph Storage Cluster. - -.. tip:: The objects Ceph stores in the Ceph Storage Cluster are not striped. - Ceph Object Storage, Ceph Block Device, and the Ceph Filesystem stripe their - data over multiple Ceph Storage Cluster objects. Ceph Clients that write - directly to the Ceph Storage Cluster via ``librados`` must perform the - striping (and parallel I/O) for themselves to obtain these benefits. - -The simplest Ceph striping format involves a stripe count of 1 object. Ceph -Clients write stripe units to a Ceph Storage Cluster object until the object is -at its maximum capacity, and then create another object for additional stripes -of data. The simplest form of striping may be sufficient for small block device -images, S3 or Swift objects and CephFS files. However, this simple form doesn't -take maximum advantage of Ceph's ability to distribute data across placement -groups, and consequently doesn't improve performance very much. The following -diagram depicts the simplest form of striping: - -.. ditaa:: - +---------------+ - | Client Data | - | Format | - | cCCC | - +---------------+ - | - +--------+-------+ - | | - v v - /-----------\ /-----------\ - | Begin cCCC| | Begin cCCC| - | Object 0 | | Object 1 | - +-----------+ +-----------+ - | stripe | | stripe | - | unit 1 | | unit 5 | - +-----------+ +-----------+ - | stripe | | stripe | - | unit 2 | | unit 6 | - +-----------+ +-----------+ - | stripe | | stripe | - | unit 3 | | unit 7 | - +-----------+ +-----------+ - | stripe | | stripe | - | unit 4 | | unit 8 | - +-----------+ +-----------+ - | End cCCC | | End cCCC | - | Object 0 | | Object 1 | - \-----------/ \-----------/ - - -If you anticipate large images sizes, large S3 or Swift objects (e.g., video), -or large CephFS directories, you may see considerable read/write performance -improvements by striping client data over multiple objects within an object set. -Significant write performance occurs when the client writes the stripe units to -their corresponding objects in parallel. Since objects get mapped to different -placement groups and further mapped to different OSDs, each write occurs in -parallel at the maximum write speed. A write to a single disk would be limited -by the head movement (e.g. 6ms per seek) and bandwidth of that one device (e.g. -100MB/s). By spreading that write over multiple objects (which map to different -placement groups and OSDs) Ceph can reduce the number of seeks per drive and -combine the throughput of multiple drives to achieve much faster write (or read) -speeds. - -.. note:: Striping is independent of object replicas. Since CRUSH - replicates objects across OSDs, stripes get replicated automatically. - -In the following diagram, client data gets striped across an object set -(``object set 1`` in the following diagram) consisting of 4 objects, where the -first stripe unit is ``stripe unit 0`` in ``object 0``, and the fourth stripe -unit is ``stripe unit 3`` in ``object 3``. After writing the fourth stripe, the -client determines if the object set is full. If the object set is not full, the -client begins writing a stripe to the first object again (``object 0`` in the -following diagram). If the object set is full, the client creates a new object -set (``object set 2`` in the following diagram), and begins writing to the first -stripe (``stripe unit 16``) in the first object in the new object set (``object -4`` in the diagram below). - -.. ditaa:: - +---------------+ - | Client Data | - | Format | - | cCCC | - +---------------+ - | - +-----------------+--------+--------+-----------------+ - | | | | +--\ - v v v v | - /-----------\ /-----------\ /-----------\ /-----------\ | - | Begin cCCC| | Begin cCCC| | Begin cCCC| | Begin cCCC| | - | Object 0 | | Object 1 | | Object 2 | | Object 3 | | - +-----------+ +-----------+ +-----------+ +-----------+ | - | stripe | | stripe | | stripe | | stripe | | - | unit 0 | | unit 1 | | unit 2 | | unit 3 | | - +-----------+ +-----------+ +-----------+ +-----------+ | - | stripe | | stripe | | stripe | | stripe | +-\ - | unit 4 | | unit 5 | | unit 6 | | unit 7 | | Object - +-----------+ +-----------+ +-----------+ +-----------+ +- Set - | stripe | | stripe | | stripe | | stripe | | 1 - | unit 8 | | unit 9 | | unit 10 | | unit 11 | +-/ - +-----------+ +-----------+ +-----------+ +-----------+ | - | stripe | | stripe | | stripe | | stripe | | - | unit 12 | | unit 13 | | unit 14 | | unit 15 | | - +-----------+ +-----------+ +-----------+ +-----------+ | - | End cCCC | | End cCCC | | End cCCC | | End cCCC | | - | Object 0 | | Object 1 | | Object 2 | | Object 3 | | - \-----------/ \-----------/ \-----------/ \-----------/ | - | - +--/ - - +--\ - | - /-----------\ /-----------\ /-----------\ /-----------\ | - | Begin cCCC| | Begin cCCC| | Begin cCCC| | Begin cCCC| | - | Object 4 | | Object 5 | | Object 6 | | Object 7 | | - +-----------+ +-----------+ +-----------+ +-----------+ | - | stripe | | stripe | | stripe | | stripe | | - | unit 16 | | unit 17 | | unit 18 | | unit 19 | | - +-----------+ +-----------+ +-----------+ +-----------+ | - | stripe | | stripe | | stripe | | stripe | +-\ - | unit 20 | | unit 21 | | unit 22 | | unit 23 | | Object - +-----------+ +-----------+ +-----------+ +-----------+ +- Set - | stripe | | stripe | | stripe | | stripe | | 2 - | unit 24 | | unit 25 | | unit 26 | | unit 27 | +-/ - +-----------+ +-----------+ +-----------+ +-----------+ | - | stripe | | stripe | | stripe | | stripe | | - | unit 28 | | unit 29 | | unit 30 | | unit 31 | | - +-----------+ +-----------+ +-----------+ +-----------+ | - | End cCCC | | End cCCC | | End cCCC | | End cCCC | | - | Object 4 | | Object 5 | | Object 6 | | Object 7 | | - \-----------/ \-----------/ \-----------/ \-----------/ | - | - +--/ - -Three important variables determine how Ceph stripes data: - -- **Object Size:** Objects in the Ceph Storage Cluster have a maximum - configurable size (e.g., 2MB, 4MB, etc.). The object size should be large - enough to accommodate many stripe units, and should be a multiple of - the stripe unit. - -- **Stripe Width:** Stripes have a configurable unit size (e.g., 64kb). - The Ceph Client divides the data it will write to objects into equally - sized stripe units, except for the last stripe unit. A stripe width, - should be a fraction of the Object Size so that an object may contain - many stripe units. - -- **Stripe Count:** The Ceph Client writes a sequence of stripe units - over a series of objects determined by the stripe count. The series - of objects is called an object set. After the Ceph Client writes to - the last object in the object set, it returns to the first object in - the object set. - -.. important:: Test the performance of your striping configuration before - putting your cluster into production. You CANNOT change these striping - parameters after you stripe the data and write it to objects. - -Once the Ceph Client has striped data to stripe units and mapped the stripe -units to objects, Ceph's CRUSH algorithm maps the objects to placement groups, -and the placement groups to Ceph OSD Daemons before the objects are stored as -files on a storage disk. - -.. note:: Since a client writes to a single pool, all data striped into objects - get mapped to placement groups in the same pool. So they use the same CRUSH - map and the same access controls. - - -.. index:: architecture; Ceph Clients - -Ceph Clients -============ - -Ceph Clients include a number of service interfaces. These include: - -- **Block Devices:** The :term:`Ceph Block Device` (a.k.a., RBD) service - provides resizable, thin-provisioned block devices with snapshotting and - cloning. Ceph stripes a block device across the cluster for high - performance. Ceph supports both kernel objects (KO) and a QEMU hypervisor - that uses ``librbd`` directly--avoiding the kernel object overhead for - virtualized systems. - -- **Object Storage:** The :term:`Ceph Object Storage` (a.k.a., RGW) service - provides RESTful APIs with interfaces that are compatible with Amazon S3 - and OpenStack Swift. - -- **Filesystem**: The :term:`Ceph Filesystem` (CephFS) service provides - a POSIX compliant filesystem usable with ``mount`` or as - a filesytem in user space (FUSE). - -Ceph can run additional instances of OSDs, MDSs, and monitors for scalability -and high availability. The following diagram depicts the high-level -architecture. - -.. ditaa:: - +--------------+ +----------------+ +-------------+ - | Block Device | | Object Storage | | Ceph FS | - +--------------+ +----------------+ +-------------+ - - +--------------+ +----------------+ +-------------+ - | librbd | | librgw | | libcephfs | - +--------------+ +----------------+ +-------------+ - - +---------------------------------------------------+ - | Ceph Storage Cluster Protocol (librados) | - +---------------------------------------------------+ - - +---------------+ +---------------+ +---------------+ - | OSDs | | MDSs | | Monitors | - +---------------+ +---------------+ +---------------+ - - -.. index:: architecture; Ceph Object Storage - -Ceph Object Storage -------------------- - -The Ceph Object Storage daemon, ``radosgw``, is a FastCGI service that provides -a RESTful_ HTTP API to store objects and metadata. It layers on top of the Ceph -Storage Cluster with its own data formats, and maintains its own user database, -authentication, and access control. The RADOS Gateway uses a unified namespace, -which means you can use either the OpenStack Swift-compatible API or the Amazon -S3-compatible API. For example, you can write data using the S3-compatible API -with one application and then read data using the Swift-compatible API with -another application. - -.. topic:: S3/Swift Objects and Store Cluster Objects Compared - - Ceph's Object Storage uses the term *object* to describe the data it stores. - S3 and Swift objects are not the same as the objects that Ceph writes to the - Ceph Storage Cluster. Ceph Object Storage objects are mapped to Ceph Storage - Cluster objects. The S3 and Swift objects do not necessarily - correspond in a 1:1 manner with an object stored in the storage cluster. It - is possible for an S3 or Swift object to map to multiple Ceph objects. - -See `Ceph Object Storage`_ for details. - - -.. index:: Ceph Block Device; block device; RBD; Rados Block Device - -Ceph Block Device ------------------ - -A Ceph Block Device stripes a block device image over multiple objects in the -Ceph Storage Cluster, where each object gets mapped to a placement group and -distributed, and the placement groups are spread across separate ``ceph-osd`` -daemons throughout the cluster. - -.. important:: Striping allows RBD block devices to perform better than a single - server could! - -Thin-provisioned snapshottable Ceph Block Devices are an attractive option for -virtualization and cloud computing. In virtual machine scenarios, people -typically deploy a Ceph Block Device with the ``rbd`` network storage driver in -QEMU/KVM, where the host machine uses ``librbd`` to provide a block device -service to the guest. Many cloud computing stacks use ``libvirt`` to integrate -with hypervisors. You can use thin-provisioned Ceph Block Devices with QEMU and -``libvirt`` to support OpenStack and CloudStack among other solutions. - -While we do not provide ``librbd`` support with other hypervisors at this time, -you may also use Ceph Block Device kernel objects to provide a block device to a -client. Other virtualization technologies such as Xen can access the Ceph Block -Device kernel object(s). This is done with the command-line tool ``rbd``. - - -.. index:: Ceph FS; Ceph Filesystem; libcephfs; MDS; metadata server; ceph-mds - -Ceph Filesystem ---------------- - -The Ceph Filesystem (Ceph FS) provides a POSIX-compliant filesystem as a -service that is layered on top of the object-based Ceph Storage Cluster. -Ceph FS files get mapped to objects that Ceph stores in the Ceph Storage -Cluster. Ceph Clients mount a CephFS filesystem as a kernel object or as -a Filesystem in User Space (FUSE). - -.. ditaa:: - +-----------------------+ +------------------------+ - | CephFS Kernel Object | | CephFS FUSE | - +-----------------------+ +------------------------+ - - +---------------------------------------------------+ - | Ceph FS Library (libcephfs) | - +---------------------------------------------------+ - - +---------------------------------------------------+ - | Ceph Storage Cluster Protocol (librados) | - +---------------------------------------------------+ - - +---------------+ +---------------+ +---------------+ - | OSDs | | MDSs | | Monitors | - +---------------+ +---------------+ +---------------+ - - -The Ceph Filesystem service includes the Ceph Metadata Server (MDS) deployed -with the Ceph Storage cluster. The purpose of the MDS is to store all the -filesystem metadata (directories, file ownership, access modes, etc) in -high-availability Ceph Metadata Servers where the metadata resides in memory. -The reason for the MDS (a daemon called ``ceph-mds``) is that simple filesystem -operations like listing a directory or changing a directory (``ls``, ``cd``) -would tax the Ceph OSD Daemons unnecessarily. So separating the metadata from -the data means that the Ceph Filesystem can provide high performance services -without taxing the Ceph Storage Cluster. - -Ceph FS separates the metadata from the data, storing the metadata in the MDS, -and storing the file data in one or more objects in the Ceph Storage Cluster. -The Ceph filesystem aims for POSIX compatibility. ``ceph-mds`` can run as a -single process, or it can be distributed out to multiple physical machines, -either for high availability or for scalability. - -- **High Availability**: The extra ``ceph-mds`` instances can be `standby`, - ready to take over the duties of any failed ``ceph-mds`` that was - `active`. This is easy because all the data, including the journal, is - stored on RADOS. The transition is triggered automatically by ``ceph-mon``. - -- **Scalability**: Multiple ``ceph-mds`` instances can be `active`, and they - will split the directory tree into subtrees (and shards of a single - busy directory), effectively balancing the load amongst all `active` - servers. - -Combinations of `standby` and `active` etc are possible, for example -running 3 `active` ``ceph-mds`` instances for scaling, and one `standby` -instance for high availability. - - - - -.. _RADOS - A Scalable, Reliable Storage Service for Petabyte-scale Storage Clusters: https://ceph.com/wp-content/uploads/2016/08/weil-rados-pdsw07.pdf -.. _Paxos: http://en.wikipedia.org/wiki/Paxos_(computer_science) -.. _Monitor Config Reference: ../rados/configuration/mon-config-ref -.. _Monitoring OSDs and PGs: ../rados/operations/monitoring-osd-pg -.. _Heartbeats: ../rados/configuration/mon-osd-interaction -.. _Monitoring OSDs: ../rados/operations/monitoring-osd-pg/#monitoring-osds -.. _CRUSH - Controlled, Scalable, Decentralized Placement of Replicated Data: https://ceph.com/wp-content/uploads/2016/08/weil-crush-sc06.pdf -.. _Data Scrubbing: ../rados/configuration/osd-config-ref#scrubbing -.. _Report Peering Failure: ../rados/configuration/mon-osd-interaction#osds-report-peering-failure -.. _Troubleshooting Peering Failure: ../rados/troubleshooting/troubleshooting-pg#placement-group-down-peering-failure -.. _Ceph Authentication and Authorization: ../rados/operations/auth-intro/ -.. _Hardware Recommendations: ../start/hardware-recommendations -.. _Network Config Reference: ../rados/configuration/network-config-ref -.. _Data Scrubbing: ../rados/configuration/osd-config-ref#scrubbing -.. _striping: http://en.wikipedia.org/wiki/Data_striping -.. _RAID: http://en.wikipedia.org/wiki/RAID -.. _RAID 0: http://en.wikipedia.org/wiki/RAID_0#RAID_0 -.. _Ceph Object Storage: ../radosgw/ -.. _RESTful: http://en.wikipedia.org/wiki/RESTful -.. _Erasure Code Notes: https://github.com/ceph/ceph/blob/40059e12af88267d0da67d8fd8d9cd81244d8f93/doc/dev/osd_internals/erasure_coding/developer_notes.rst -.. _Cache Tiering: ../rados/operations/cache-tiering -.. _Set Pool Values: ../rados/operations/pools#set-pool-values -.. _Kerberos: http://en.wikipedia.org/wiki/Kerberos_(protocol) -.. _Cephx Config Guide: ../rados/configuration/auth-config-ref -.. _User Management: ../rados/operations/user-management