X-Git-Url: https://gerrit.opnfv.org/gerrit/gitweb?a=blobdiff_plain;f=src%2Fceph%2Fdoc%2Frados%2Foperations%2Fmonitoring.rst;fp=src%2Fceph%2Fdoc%2Frados%2Foperations%2Fmonitoring.rst;h=c291440b78a6f6bca35ea28d7636da1864a917ef;hb=812ff6ca9fcd3e629e49d4328905f33eee8ca3f5;hp=0000000000000000000000000000000000000000;hpb=15280273faafb77777eab341909a3f495cf248d9;p=stor4nfv.git diff --git a/src/ceph/doc/rados/operations/monitoring.rst b/src/ceph/doc/rados/operations/monitoring.rst new file mode 100644 index 0000000..c291440 --- /dev/null +++ b/src/ceph/doc/rados/operations/monitoring.rst @@ -0,0 +1,351 @@ +====================== + Monitoring a Cluster +====================== + +Once you have a running cluster, you may use the ``ceph`` tool to monitor your +cluster. Monitoring a cluster typically involves checking OSD status, monitor +status, placement group status and metadata server status. + +Using the command line +====================== + +Interactive mode +---------------- + +To run the ``ceph`` tool in interactive mode, type ``ceph`` at the command line +with no arguments. For example:: + + ceph + ceph> health + ceph> status + ceph> quorum_status + ceph> mon_status + +Non-default paths +----------------- + +If you specified non-default locations for your configuration or keyring, +you may specify their locations:: + + ceph -c /path/to/conf -k /path/to/keyring health + +Checking a Cluster's Status +=========================== + +After you start your cluster, and before you start reading and/or +writing data, check your cluster's status first. + +To check a cluster's status, execute the following:: + + ceph status + +Or:: + + ceph -s + +In interactive mode, type ``status`` and press **Enter**. :: + + ceph> status + +Ceph will print the cluster status. For example, a tiny Ceph demonstration +cluster with one of each service may print the following: + +:: + + cluster: + id: 477e46f1-ae41-4e43-9c8f-72c918ab0a20 + health: HEALTH_OK + + services: + mon: 1 daemons, quorum a + mgr: x(active) + mds: 1/1/1 up {0=a=up:active} + osd: 1 osds: 1 up, 1 in + + data: + pools: 2 pools, 16 pgs + objects: 21 objects, 2246 bytes + usage: 546 GB used, 384 GB / 931 GB avail + pgs: 16 active+clean + + +.. topic:: How Ceph Calculates Data Usage + + The ``usage`` value reflects the *actual* amount of raw storage used. The + ``xxx GB / xxx GB`` value means the amount available (the lesser number) + of the overall storage capacity of the cluster. The notional number reflects + the size of the stored data before it is replicated, cloned or snapshotted. + Therefore, the amount of data actually stored typically exceeds the notional + amount stored, because Ceph creates replicas of the data and may also use + storage capacity for cloning and snapshotting. + + +Watching a Cluster +================== + +In addition to local logging by each daemon, Ceph clusters maintain +a *cluster log* that records high level events about the whole system. +This is logged to disk on monitor servers (as ``/var/log/ceph/ceph.log`` by +default), but can also be monitored via the command line. + +To follow the cluster log, use the following command + +:: + + ceph -w + +Ceph will print the status of the system, followed by each log message as it +is emitted. For example: + +:: + + cluster: + id: 477e46f1-ae41-4e43-9c8f-72c918ab0a20 + health: HEALTH_OK + + services: + mon: 1 daemons, quorum a + mgr: x(active) + mds: 1/1/1 up {0=a=up:active} + osd: 1 osds: 1 up, 1 in + + data: + pools: 2 pools, 16 pgs + objects: 21 objects, 2246 bytes + usage: 546 GB used, 384 GB / 931 GB avail + pgs: 16 active+clean + + + 2017-07-24 08:15:11.329298 mon.a mon.0 172.21.9.34:6789/0 23 : cluster [INF] osd.0 172.21.9.34:6806/20527 boot + 2017-07-24 08:15:14.258143 mon.a mon.0 172.21.9.34:6789/0 39 : cluster [INF] Activating manager daemon x + 2017-07-24 08:15:15.446025 mon.a mon.0 172.21.9.34:6789/0 47 : cluster [INF] Manager daemon x is now available + + +In addition to using ``ceph -w`` to print log lines as they are emitted, +use ``ceph log last [n]`` to see the most recent ``n`` lines from the cluster +log. + +Monitoring Health Checks +======================== + +Ceph continously runs various *health checks* against its own status. When +a health check fails, this is reflected in the output of ``ceph status`` (or +``ceph health``). In addition, messages are sent to the cluster log to +indicate when a check fails, and when the cluster recovers. + +For example, when an OSD goes down, the ``health`` section of the status +output may be updated as follows: + +:: + + health: HEALTH_WARN + 1 osds down + Degraded data redundancy: 21/63 objects degraded (33.333%), 16 pgs unclean, 16 pgs degraded + +At this time, cluster log messages are also emitted to record the failure of the +health checks: + +:: + + 2017-07-25 10:08:58.265945 mon.a mon.0 172.21.9.34:6789/0 91 : cluster [WRN] Health check failed: 1 osds down (OSD_DOWN) + 2017-07-25 10:09:01.302624 mon.a mon.0 172.21.9.34:6789/0 94 : cluster [WRN] Health check failed: Degraded data redundancy: 21/63 objects degraded (33.333%), 16 pgs unclean, 16 pgs degraded (PG_DEGRADED) + +When the OSD comes back online, the cluster log records the cluster's return +to a health state: + +:: + + 2017-07-25 10:11:11.526841 mon.a mon.0 172.21.9.34:6789/0 109 : cluster [WRN] Health check update: Degraded data redundancy: 2 pgs unclean, 2 pgs degraded, 2 pgs undersized (PG_DEGRADED) + 2017-07-25 10:11:13.535493 mon.a mon.0 172.21.9.34:6789/0 110 : cluster [INF] Health check cleared: PG_DEGRADED (was: Degraded data redundancy: 2 pgs unclean, 2 pgs degraded, 2 pgs undersized) + 2017-07-25 10:11:13.535577 mon.a mon.0 172.21.9.34:6789/0 111 : cluster [INF] Cluster is now healthy + + +Detecting configuration issues +============================== + +In addition to the health checks that Ceph continuously runs on its +own status, there are some configuration issues that may only be detected +by an external tool. + +Use the `ceph-medic`_ tool to run these additional checks on your Ceph +cluster's configuration. + +Checking a Cluster's Usage Stats +================================ + +To check a cluster's data usage and data distribution among pools, you can +use the ``df`` option. It is similar to Linux ``df``. Execute +the following:: + + ceph df + +The **GLOBAL** section of the output provides an overview of the amount of +storage your cluster uses for your data. + +- **SIZE:** The overall storage capacity of the cluster. +- **AVAIL:** The amount of free space available in the cluster. +- **RAW USED:** The amount of raw storage used. +- **% RAW USED:** The percentage of raw storage used. Use this number in + conjunction with the ``full ratio`` and ``near full ratio`` to ensure that + you are not reaching your cluster's capacity. See `Storage Capacity`_ for + additional details. + +The **POOLS** section of the output provides a list of pools and the notional +usage of each pool. The output from this section **DOES NOT** reflect replicas, +clones or snapshots. For example, if you store an object with 1MB of data, the +notional usage will be 1MB, but the actual usage may be 2MB or more depending +on the number of replicas, clones and snapshots. + +- **NAME:** The name of the pool. +- **ID:** The pool ID. +- **USED:** The notional amount of data stored in kilobytes, unless the number + appends **M** for megabytes or **G** for gigabytes. +- **%USED:** The notional percentage of storage used per pool. +- **MAX AVAIL:** An estimate of the notional amount of data that can be written + to this pool. +- **Objects:** The notional number of objects stored per pool. + +.. note:: The numbers in the **POOLS** section are notional. They are not + inclusive of the number of replicas, shapshots or clones. As a result, + the sum of the **USED** and **%USED** amounts will not add up to the + **RAW USED** and **%RAW USED** amounts in the **GLOBAL** section of the + output. + +.. note:: The **MAX AVAIL** value is a complicated function of the + replication or erasure code used, the CRUSH rule that maps storage + to devices, the utilization of those devices, and the configured + mon_osd_full_ratio. + + + +Checking OSD Status +=================== + +You can check OSDs to ensure they are ``up`` and ``in`` by executing:: + + ceph osd stat + +Or:: + + ceph osd dump + +You can also check view OSDs according to their position in the CRUSH map. :: + + ceph osd tree + +Ceph will print out a CRUSH tree with a host, its OSDs, whether they are up +and their weight. :: + + # id weight type name up/down reweight + -1 3 pool default + -3 3 rack mainrack + -2 3 host osd-host + 0 1 osd.0 up 1 + 1 1 osd.1 up 1 + 2 1 osd.2 up 1 + +For a detailed discussion, refer to `Monitoring OSDs and Placement Groups`_. + +Checking Monitor Status +======================= + +If your cluster has multiple monitors (likely), you should check the monitor +quorum status after you start the cluster before reading and/or writing data. A +quorum must be present when multiple monitors are running. You should also check +monitor status periodically to ensure that they are running. + +To see display the monitor map, execute the following:: + + ceph mon stat + +Or:: + + ceph mon dump + +To check the quorum status for the monitor cluster, execute the following:: + + ceph quorum_status + +Ceph will return the quorum status. For example, a Ceph cluster consisting of +three monitors may return the following: + +.. code-block:: javascript + + { "election_epoch": 10, + "quorum": [ + 0, + 1, + 2], + "monmap": { "epoch": 1, + "fsid": "444b489c-4f16-4b75-83f0-cb8097468898", + "modified": "2011-12-12 13:28:27.505520", + "created": "2011-12-12 13:28:27.505520", + "mons": [ + { "rank": 0, + "name": "a", + "addr": "127.0.0.1:6789\/0"}, + { "rank": 1, + "name": "b", + "addr": "127.0.0.1:6790\/0"}, + { "rank": 2, + "name": "c", + "addr": "127.0.0.1:6791\/0"} + ] + } + } + +Checking MDS Status +=================== + +Metadata servers provide metadata services for Ceph FS. Metadata servers have +two sets of states: ``up | down`` and ``active | inactive``. To ensure your +metadata servers are ``up`` and ``active``, execute the following:: + + ceph mds stat + +To display details of the metadata cluster, execute the following:: + + ceph fs dump + + +Checking Placement Group States +=============================== + +Placement groups map objects to OSDs. When you monitor your +placement groups, you will want them to be ``active`` and ``clean``. +For a detailed discussion, refer to `Monitoring OSDs and Placement Groups`_. + +.. _Monitoring OSDs and Placement Groups: ../monitoring-osd-pg + + +Using the Admin Socket +====================== + +The Ceph admin socket allows you to query a daemon via a socket interface. +By default, Ceph sockets reside under ``/var/run/ceph``. To access a daemon +via the admin socket, login to the host running the daemon and use the +following command:: + + ceph daemon {daemon-name} + ceph daemon {path-to-socket-file} + +For example, the following are equivalent:: + + ceph daemon osd.0 foo + ceph daemon /var/run/ceph/ceph-osd.0.asok foo + +To view the available admin socket commands, execute the following command:: + + ceph daemon {daemon-name} help + +The admin socket command enables you to show and set your configuration at +runtime. See `Viewing a Configuration at Runtime`_ for details. + +Additionally, you can set configuration values at runtime directly (i.e., the +admin socket bypasses the monitor, unlike ``ceph tell {daemon-type}.{id} +injectargs``, which relies on the monitor but doesn't require you to login +directly to the host in question ). + +.. _Viewing a Configuration at Runtime: ../../configuration/ceph-conf#ceph-runtime-config +.. _Storage Capacity: ../../configuration/mon-config-ref#storage-capacity +.. _ceph-medic: http://docs.ceph.com/ceph-medic/master/