src/ceph/doc/rados/operations/monitoring.rst

   1 ======================
   2  Monitoring a Cluster
   3 ======================
   4
   5 Once you have a running cluster, you may use the ``ceph`` tool to monitor your
   6 cluster. Monitoring a cluster typically involves checking OSD status, monitor
   7 status, placement group status and metadata server status.
   8
   9 Using the command line
  10 ======================
  11
  12 Interactive mode
  13 ----------------
  14
  15 To run the ``ceph`` tool in interactive mode, type ``ceph`` at the command line
  16 with no arguments.  For example::
  17
  18         ceph
  19         ceph> health
  20         ceph> status
  21         ceph> quorum_status
  22         ceph> mon_status
  23
  24 Non-default paths
  25 -----------------
  26
  27 If you specified non-default locations for your configuration or keyring,
  28 you may specify their locations::
  29
  30    ceph -c /path/to/conf -k /path/to/keyring health
  31
  32 Checking a Cluster's Status
  33 ===========================
  34
  35 After you start your cluster, and before you start reading and/or
  36 writing data, check your cluster's status first.
  37
  38 To check a cluster's status, execute the following::
  39
  40         ceph status
  41
  42 Or::
  43
  44         ceph -s
  45
  46 In interactive mode, type ``status`` and press **Enter**. ::
  47
  48         ceph> status
  49
  50 Ceph will print the cluster status. For example, a tiny Ceph demonstration
  51 cluster with one of each service may print the following:
  52
  53 ::
  54
  55   cluster:
  56     id:     477e46f1-ae41-4e43-9c8f-72c918ab0a20
  57     health: HEALTH_OK
  58
  59   services:
  60     mon: 1 daemons, quorum a
  61     mgr: x(active)
  62     mds: 1/1/1 up {0=a=up:active}
  63     osd: 1 osds: 1 up, 1 in
  64
  65   data:
  66     pools:   2 pools, 16 pgs
  67     objects: 21 objects, 2246 bytes
  68     usage:   546 GB used, 384 GB / 931 GB avail
  69     pgs:     16 active+clean
  70
  71
  72 .. topic:: How Ceph Calculates Data Usage
  73
  74    The ``usage`` value reflects the *actual* amount of raw storage used. The
  75    ``xxx GB / xxx GB`` value means the amount available (the lesser number)
  76    of the overall storage capacity of the cluster. The notional number reflects
  77    the size of the stored data before it is replicated, cloned or snapshotted.
  78    Therefore, the amount of data actually stored typically exceeds the notional
  79    amount stored, because Ceph creates replicas of the data and may also use
  80    storage capacity for cloning and snapshotting.
  81
  82
  83 Watching a Cluster
  84 ==================
  85
  86 In addition to local logging by each daemon, Ceph clusters maintain
  87 a *cluster log* that records high level events about the whole system.
  88 This is logged to disk on monitor servers (as ``/var/log/ceph/ceph.log`` by
  89 default), but can also be monitored via the command line.
  90
  91 To follow the cluster log, use the following command
  92
  93 ::
  94
  95         ceph -w
  96
  97 Ceph will print the status of the system, followed by each log message as it
  98 is emitted.  For example:
  99
 100 ::
 101
 102   cluster:
 103     id:     477e46f1-ae41-4e43-9c8f-72c918ab0a20
 104     health: HEALTH_OK
 105
 106   services:
 107     mon: 1 daemons, quorum a
 108     mgr: x(active)
 109     mds: 1/1/1 up {0=a=up:active}
 110     osd: 1 osds: 1 up, 1 in
 111
 112   data:
 113     pools:   2 pools, 16 pgs
 114     objects: 21 objects, 2246 bytes
 115     usage:   546 GB used, 384 GB / 931 GB avail
 116     pgs:     16 active+clean
 117
 118
 119   2017-07-24 08:15:11.329298 mon.a mon.0 172.21.9.34:6789/0 23 : cluster [INF] osd.0 172.21.9.34:6806/20527 boot
 120   2017-07-24 08:15:14.258143 mon.a mon.0 172.21.9.34:6789/0 39 : cluster [INF] Activating manager daemon x
 121   2017-07-24 08:15:15.446025 mon.a mon.0 172.21.9.34:6789/0 47 : cluster [INF] Manager daemon x is now available
 122
 123
 124 In addition to using ``ceph -w`` to print log lines as they are emitted,
 125 use ``ceph log last [n]`` to see the most recent ``n`` lines from the cluster
 126 log.
 127
 128 Monitoring Health Checks
 129 ========================
 130
 131 Ceph continously runs various *health checks* against its own status.  When
 132 a health check fails, this is reflected in the output of ``ceph status`` (or
 133 ``ceph health``).  In addition, messages are sent to the cluster log to
 134 indicate when a check fails, and when the cluster recovers.
 135
 136 For example, when an OSD goes down, the ``health`` section of the status
 137 output may be updated as follows:
 138
 139 ::
 140
 141     health: HEALTH_WARN
 142             1 osds down
 143             Degraded data redundancy: 21/63 objects degraded (33.333%), 16 pgs unclean, 16 pgs degraded
 144
 145 At this time, cluster log messages are also emitted to record the failure of the
 146 health checks:
 147
 148 ::
 149
 150     2017-07-25 10:08:58.265945 mon.a mon.0 172.21.9.34:6789/0 91 : cluster [WRN] Health check failed: 1 osds down (OSD_DOWN)
 151     2017-07-25 10:09:01.302624 mon.a mon.0 172.21.9.34:6789/0 94 : cluster [WRN] Health check failed: Degraded data redundancy: 21/63 objects degraded (33.333%), 16 pgs unclean, 16 pgs degraded (PG_DEGRADED)
 152
 153 When the OSD comes back online, the cluster log records the cluster's return
 154 to a health state:
 155
 156 ::
 157
 158     2017-07-25 10:11:11.526841 mon.a mon.0 172.21.9.34:6789/0 109 : cluster [WRN] Health check update: Degraded data redundancy: 2 pgs unclean, 2 pgs degraded, 2 pgs undersized (PG_DEGRADED)
 159     2017-07-25 10:11:13.535493 mon.a mon.0 172.21.9.34:6789/0 110 : cluster [INF] Health check cleared: PG_DEGRADED (was: Degraded data redundancy: 2 pgs unclean, 2 pgs degraded, 2 pgs undersized)
 160     2017-07-25 10:11:13.535577 mon.a mon.0 172.21.9.34:6789/0 111 : cluster [INF] Cluster is now healthy
 161
 162
 163 Detecting configuration issues
 164 ==============================
 165
 166 In addition to the health checks that Ceph continuously runs on its
 167 own status, there are some configuration issues that may only be detected
 168 by an external tool.
 169
 170 Use the `ceph-medic`_ tool to run these additional checks on your Ceph
 171 cluster's configuration.
 172
 173 Checking a Cluster's Usage Stats
 174 ================================
 175
 176 To check a cluster's data usage and data distribution among pools, you can
 177 use the ``df`` option. It is similar to Linux ``df``. Execute
 178 the following::
 179
 180         ceph df
 181
 182 The **GLOBAL** section of the output provides an overview of the amount of
 183 storage your cluster uses for your data.
 184
 185 - **SIZE:** The overall storage capacity of the cluster.
 186 - **AVAIL:** The amount of free space available in the cluster.
 187 - **RAW USED:** The amount of raw storage used.
 188 - **% RAW USED:** The percentage of raw storage used. Use this number in
 189   conjunction with the ``full ratio`` and ``near full ratio`` to ensure that
 190   you are not reaching your cluster's capacity. See `Storage Capacity`_ for
 191   additional details.
 192
 193 The **POOLS** section of the output provides a list of pools and the notional
 194 usage of each pool. The output from this section **DOES NOT** reflect replicas,
 195 clones or snapshots. For example, if you store an object with 1MB of data, the
 196 notional usage will be 1MB, but the actual usage may be 2MB or more depending
 197 on the number of replicas, clones and snapshots.
 198
 199 - **NAME:** The name of the pool.
 200 - **ID:** The pool ID.
 201 - **USED:** The notional amount of data stored in kilobytes, unless the number
 202   appends **M** for megabytes or **G** for gigabytes.
 203 - **%USED:** The notional percentage of storage used per pool.
 204 - **MAX AVAIL:** An estimate of the notional amount of data that can be written
 205   to this pool.
 206 - **Objects:** The notional number of objects stored per pool.
 207
 208 .. note:: The numbers in the **POOLS** section are notional. They are not
 209    inclusive of the number of replicas, shapshots or clones. As a result,
 210    the sum of the **USED** and **%USED** amounts will not add up to the
 211    **RAW USED** and **%RAW USED** amounts in the **GLOBAL** section of the
 212    output.
 213
 214 .. note:: The **MAX AVAIL** value is a complicated function of the
 215    replication or erasure code used, the CRUSH rule that maps storage
 216    to devices, the utilization of those devices, and the configured
 217    mon_osd_full_ratio.
 218
 219
 220
 221 Checking OSD Status
 222 ===================
 223
 224 You can check OSDs to ensure they are ``up`` and ``in`` by executing::
 225
 226         ceph osd stat
 227
 228 Or::
 229
 230         ceph osd dump
 231
 232 You can also check view OSDs according to their position in the CRUSH map. ::
 233
 234         ceph osd tree
 235
 236 Ceph will print out a CRUSH tree with a host, its OSDs, whether they are up
 237 and their weight. ::
 238
 239         # id    weight  type name       up/down reweight
 240         -1      3       pool default
 241         -3      3               rack mainrack
 242         -2      3                       host osd-host
 243         0       1                               osd.0   up      1
 244         1       1                               osd.1   up      1
 245         2       1                               osd.2   up      1
 246
 247 For a detailed discussion, refer to `Monitoring OSDs and Placement Groups`_.
 248
 249 Checking Monitor Status
 250 =======================
 251
 252 If your cluster has multiple monitors (likely), you should check the monitor
 253 quorum status after you start the cluster before reading and/or writing data. A
 254 quorum must be present when multiple monitors are running. You should also check
 255 monitor status periodically to ensure that they are running.
 256
 257 To see display the monitor map, execute the following::
 258
 259         ceph mon stat
 260
 261 Or::
 262
 263         ceph mon dump
 264
 265 To check the quorum status for the monitor cluster, execute the following::
 266
 267         ceph quorum_status
 268
 269 Ceph will return the quorum status. For example, a Ceph  cluster consisting of
 270 three monitors may return the following:
 271
 272 .. code-block:: javascript
 273
 274         { "election_epoch": 10,
 275           "quorum": [
 276                 0,
 277                 1,
 278                 2],
 279           "monmap": { "epoch": 1,
 280               "fsid": "444b489c-4f16-4b75-83f0-cb8097468898",
 281               "modified": "2011-12-12 13:28:27.505520",
 282               "created": "2011-12-12 13:28:27.505520",
 283               "mons": [
 284                     { "rank": 0,
 285                       "name": "a",
 286                       "addr": "127.0.0.1:6789\/0"},
 287                     { "rank": 1,
 288                       "name": "b",
 289                       "addr": "127.0.0.1:6790\/0"},
 290                     { "rank": 2,
 291                       "name": "c",
 292                       "addr": "127.0.0.1:6791\/0"}
 293                    ]
 294             }
 295         }
 296
 297 Checking MDS Status
 298 ===================
 299
 300 Metadata servers provide metadata services for  Ceph FS. Metadata servers have
 301 two sets of states: ``up | down`` and ``active | inactive``. To ensure your
 302 metadata servers are ``up`` and ``active``,  execute the following::
 303
 304         ceph mds stat
 305
 306 To display details of the metadata cluster, execute the following::
 307
 308         ceph fs dump
 309
 310
 311 Checking Placement Group States
 312 ===============================
 313
 314 Placement groups map objects to OSDs. When you monitor your
 315 placement groups,  you will want them to be ``active`` and ``clean``.
 316 For a detailed discussion, refer to `Monitoring OSDs and Placement Groups`_.
 317
 318 .. _Monitoring OSDs and Placement Groups: ../monitoring-osd-pg
 319
 320
 321 Using the Admin Socket
 322 ======================
 323
 324 The Ceph admin socket allows you to query a daemon via a socket interface.
 325 By default, Ceph sockets reside under ``/var/run/ceph``. To access a daemon
 326 via the admin socket, login to the host running the daemon and use the
 327 following command::
 328
 329         ceph daemon {daemon-name}
 330         ceph daemon {path-to-socket-file}
 331
 332 For example, the following are equivalent::
 333
 334     ceph daemon osd.0 foo
 335     ceph daemon /var/run/ceph/ceph-osd.0.asok foo
 336
 337 To view the available admin socket commands, execute the following command::
 338
 339         ceph daemon {daemon-name} help
 340
 341 The admin socket command enables you to show and set your configuration at
 342 runtime. See `Viewing a Configuration at Runtime`_ for details.
 343
 344 Additionally, you can set configuration values at runtime directly (i.e., the
 345 admin socket bypasses the monitor, unlike ``ceph tell {daemon-type}.{id}
 346 injectargs``, which relies on the monitor but doesn't require you to login
 347 directly to the host in question ).
 348
 349 .. _Viewing a Configuration at Runtime: ../../configuration/ceph-conf#ceph-runtime-config
 350 .. _Storage Capacity: ../../configuration/mon-config-ref#storage-capacity
 351 .. _ceph-medic: http://docs.ceph.com/ceph-medic/master/