src/ceph/doc/rados/configuration/mon-osd-interaction.rst

   1 =====================================
   2  Configuring Monitor/OSD Interaction
   3 =====================================
   4
   5 .. index:: heartbeat
   6
   7 After you have completed your initial Ceph configuration, you may deploy and run
   8 Ceph.  When you execute a command such as ``ceph health`` or ``ceph -s``,  the
   9 :term:`Ceph Monitor` reports on the current state of the :term:`Ceph Storage
  10 Cluster`. The Ceph Monitor knows about the Ceph Storage Cluster by requiring
  11 reports from each :term:`Ceph OSD Daemon`, and by receiving reports from Ceph
  12 OSD Daemons about the status of their neighboring Ceph OSD Daemons. If the Ceph
  13 Monitor doesn't receive reports, or if it receives reports of changes in the
  14 Ceph Storage Cluster, the Ceph Monitor updates the status of the :term:`Ceph
  15 Cluster Map`.
  16
  17 Ceph provides reasonable default settings for Ceph Monitor/Ceph OSD Daemon
  18 interaction. However, you may override the defaults. The following sections
  19 describe how Ceph Monitors and Ceph OSD Daemons interact for the purposes of
  20 monitoring the Ceph Storage Cluster.
  21
  22 .. index:: heartbeat interval
  23
  24 OSDs Check Heartbeats
  25 =====================
  26
  27 Each Ceph OSD Daemon checks the heartbeat of other Ceph OSD Daemons every 6
  28 seconds. You can change the heartbeat interval by adding an ``osd heartbeat
  29 interval`` setting under the ``[osd]`` section of your Ceph configuration file,
  30 or by setting the value at runtime. If a neighboring Ceph OSD Daemon doesn't
  31 show a heartbeat within a 20 second grace period, the Ceph OSD Daemon may
  32 consider the neighboring Ceph OSD Daemon ``down`` and report it back to a Ceph
  33 Monitor, which will update the Ceph Cluster Map. You may change this grace
  34 period by adding an ``osd heartbeat grace`` setting under the ``[mon]``
  35 and ``[osd]`` or ``[global]`` section of your Ceph configuration file,
  36 or by setting the value at runtime.
  37
  38
  39 .. ditaa:: +---------+          +---------+
  40            |  OSD 1  |          |  OSD 2  |
  41            +---------+          +---------+
  42                 |                    |
  43                 |----+ Heartbeat     |
  44                 |    | Interval      |
  45                 |<---+ Exceeded      |
  46                 |                    |
  47                 |       Check        |
  48                 |     Heartbeat      |
  49                 |------------------->|
  50                 |                    |
  51                 |<-------------------|
  52                 |   Heart Beating    |
  53                 |                    |
  54                 |----+ Heartbeat     |
  55                 |    | Interval      |
  56                 |<---+ Exceeded      |
  57                 |                    |
  58                 |       Check        |
  59                 |     Heartbeat      |
  60                 |------------------->|
  61                 |                    |
  62                 |----+ Grace         |
  63                 |    | Period        |
  64                 |<---+ Exceeded      |
  65                 |                    |
  66                 |----+ Mark          |
  67                 |    | OSD 2         |
  68                 |<---+ Down          |
  69
  70
  71 .. index:: OSD down report
  72
  73 OSDs Report Down OSDs
  74 =====================
  75
  76 By default, two Ceph OSD Daemons from different hosts must report to the Ceph
  77 Monitors that another Ceph OSD Daemon is ``down`` before the Ceph Monitors
  78 acknowledge that the reported Ceph OSD Daemon is ``down``. But there is chance
  79 that all the OSDs reporting the failure are hosted in a rack with a bad switch
  80 which has trouble connecting to another OSD. To avoid this sort of false alarm,
  81 we consider the peers reporting a failure a proxy for a potential "subcluster"
  82 over the overall cluster that is similarly laggy. This is clearly not true in
  83 all cases, but will sometimes help us localize the grace correction to a subset
  84 of the system that is unhappy. ``mon osd reporter subtree level`` is used to
  85 group the peers into the "subcluster" by their common ancestor type in CRUSH
  86 map. By default, only two reports from different subtree are required to report
  87 another Ceph OSD Daemon ``down``. You can change the number of reporters from
  88 unique subtrees and the common ancestor type required to report a Ceph OSD
  89 Daemon ``down`` to a Ceph Monitor by adding an ``mon osd min down reporters``
  90 and ``mon osd reporter subtree level`` settings  under the ``[mon]`` section of
  91 your Ceph configuration file, or by setting the value at runtime.
  92
  93
  94 .. ditaa:: +---------+     +---------+      +---------+
  95            |  OSD 1  |     |  OSD 2  |      | Monitor |
  96            +---------+     +---------+      +---------+
  97                 |               |                |
  98                 | OSD 3 Is Down |                |
  99                 |---------------+--------------->|
 100                 |               |                |
 101                 |               |                |
 102                 |               | OSD 3 Is Down  |
 103                 |               |--------------->|
 104                 |               |                |
 105                 |               |                |
 106                 |               |                |---------+ Mark
 107                 |               |                |         | OSD 3
 108                 |               |                |<--------+ Down
 109
 110
 111 .. index:: peering failure
 112
 113 OSDs Report Peering Failure
 114 ===========================
 115
 116 If a Ceph OSD Daemon cannot peer with any of the Ceph OSD Daemons defined in its
 117 Ceph configuration file (or the cluster map), it will ping a Ceph Monitor for
 118 the most recent copy of the cluster map every 30 seconds. You can change the
 119 Ceph Monitor heartbeat interval by adding an ``osd mon heartbeat interval``
 120 setting under the ``[osd]`` section of your Ceph configuration file, or by
 121 setting the value at runtime.
 122
 123 .. ditaa:: +---------+     +---------+     +-------+     +---------+
 124            |  OSD 1  |     |  OSD 2  |     | OSD 3 |     | Monitor |
 125            +---------+     +---------+     +-------+     +---------+
 126                 |               |              |              |
 127                 |  Request To   |              |              |
 128                 |     Peer      |              |              |
 129                 |-------------->|              |              |
 130                 |<--------------|              |              |
 131                 |    Peering                   |              |
 132                 |                              |              |
 133                 |  Request To                  |              |
 134                 |     Peer                     |              |
 135                 |----------------------------->|              |
 136                 |                                             |
 137                 |----+ OSD Monitor                            |
 138                 |    | Heartbeat                              |
 139                 |<---+ Interval Exceeded                      |
 140                 |                                             |
 141                 |         Failed to Peer with OSD 3           |
 142                 |-------------------------------------------->|
 143                 |<--------------------------------------------|
 144                 |          Receive New Cluster Map            |
 145
 146
 147 .. index:: OSD status
 148
 149 OSDs Report Their Status
 150 ========================
 151
 152 If an Ceph OSD Daemon doesn't report to a Ceph Monitor, the Ceph Monitor will
 153 consider the Ceph OSD Daemon ``down`` after the  ``mon osd report timeout``
 154 elapses. A Ceph OSD Daemon sends a report to a Ceph Monitor when a reportable
 155 event such as a failure, a change in placement group stats, a change in
 156 ``up_thru`` or when it boots within 5 seconds. You can change the Ceph OSD
 157 Daemon minimum report interval by adding an ``osd mon report interval min``
 158 setting under the ``[osd]`` section of your Ceph configuration file, or by
 159 setting the value at runtime. A Ceph OSD Daemon sends a report to a Ceph
 160 Monitor every 120 seconds irrespective of whether any notable changes occur.
 161 You can change the Ceph Monitor report interval by adding an ``osd mon report
 162 interval max`` setting under the ``[osd]`` section of your Ceph configuration
 163 file, or by setting the value at runtime.
 164
 165
 166 .. ditaa:: +---------+          +---------+
 167            |  OSD 1  |          | Monitor |
 168            +---------+          +---------+
 169                 |                    |
 170                 |----+ Report Min    |
 171                 |    | Interval      |
 172                 |<---+ Exceeded      |
 173                 |                    |
 174                 |----+ Reportable    |
 175                 |    | Event         |
 176                 |<---+ Occurs        |
 177                 |                    |
 178                 |     Report To      |
 179                 |      Monitor       |
 180                 |------------------->|
 181                 |                    |
 182                 |----+ Report Max    |
 183                 |    | Interval      |
 184                 |<---+ Exceeded      |
 185                 |                    |
 186                 |     Report To      |
 187                 |      Monitor       |
 188                 |------------------->|
 189                 |                    |
 190                 |----+ Monitor       |
 191                 |    | Fails         |
 192                 |<---+               |
 193                                      +----+ Monitor OSD
 194                                      |    | Report Timeout
 195                                      |<---+ Exceeded
 196                                      |
 197                                      +----+ Mark
 198                                      |    | OSD 1
 199                                      |<---+ Down
 200
 201
 202
 203
 204 Configuration Settings
 205 ======================
 206
 207 When modifying heartbeat settings, you should include them in the ``[global]``
 208 section of your configuration file.
 209
 210 .. index:: monitor heartbeat
 211
 212 Monitor Settings
 213 ----------------
 214
 215 ``mon osd min up ratio``
 216
 217 :Description: The minimum ratio of ``up`` Ceph OSD Daemons before Ceph will
 218               mark Ceph OSD Daemons ``down``.
 219
 220 :Type: Double
 221 :Default: ``.3``
 222
 223
 224 ``mon osd min in ratio``
 225
 226 :Description: The minimum ratio of ``in`` Ceph OSD Daemons before Ceph will
 227               mark Ceph OSD Daemons ``out``.
 228
 229 :Type: Double
 230 :Default: ``.75``
 231
 232
 233 ``mon osd laggy halflife``
 234
 235 :Description: The number of seconds laggy estimates will decay.
 236 :Type: Integer
 237 :Default: ``60*60``
 238
 239
 240 ``mon osd laggy weight``
 241
 242 :Description: The weight for new samples in laggy estimation decay.
 243 :Type: Double
 244 :Default: ``0.3``
 245
 246
 247
 248 ``mon osd laggy max interval``
 249
 250 :Description: Maximum value of ``laggy_interval`` in laggy estimations (in seconds).
 251               Monitor uses an adaptive approach to evaluate the ``laggy_interval`` of
 252               a certain OSD. This value will be used to calculate the grace time for
 253               that OSD.
 254 :Type: Integer
 255 :Default: 300
 256
 257 ``mon osd adjust heartbeat grace``
 258
 259 :Description: If set to ``true``, Ceph will scale based on laggy estimations.
 260 :Type: Boolean
 261 :Default: ``true``
 262
 263
 264 ``mon osd adjust down out interval``
 265
 266 :Description: If set to ``true``, Ceph will scaled based on laggy estimations.
 267 :Type: Boolean
 268 :Default: ``true``
 269
 270
 271 ``mon osd auto mark in``
 272
 273 :Description: Ceph will mark any booting Ceph OSD Daemons as ``in``
 274               the Ceph Storage Cluster.
 275
 276 :Type: Boolean
 277 :Default: ``false``
 278
 279
 280 ``mon osd auto mark auto out in``
 281
 282 :Description: Ceph will mark booting Ceph OSD Daemons auto marked ``out``
 283               of the Ceph Storage Cluster as ``in`` the cluster.
 284
 285 :Type: Boolean
 286 :Default: ``true``
 287
 288
 289 ``mon osd auto mark new in``
 290
 291 :Description: Ceph will mark booting new Ceph OSD Daemons as ``in`` the
 292               Ceph Storage Cluster.
 293
 294 :Type: Boolean
 295 :Default: ``true``
 296
 297
 298 ``mon osd down out interval``
 299
 300 :Description: The number of seconds Ceph waits before marking a Ceph OSD Daemon
 301               ``down`` and ``out`` if it doesn't respond.
 302
 303 :Type: 32-bit Integer
 304 :Default: ``600``
 305
 306
 307 ``mon osd down out subtree limit``
 308
 309 :Description: The smallest :term:`CRUSH` unit type that Ceph will **not**
 310               automatically mark out. For instance, if set to ``host`` and if
 311               all OSDs of a host are down, Ceph will not automatically mark out
 312               these OSDs.
 313
 314 :Type: String
 315 :Default: ``rack``
 316
 317
 318 ``mon osd report timeout``
 319
 320 :Description: The grace period in seconds before declaring
 321               unresponsive Ceph OSD Daemons ``down``.
 322
 323 :Type: 32-bit Integer
 324 :Default: ``900``
 325
 326 ``mon osd min down reporters``
 327
 328 :Description: The minimum number of Ceph OSD Daemons required to report a
 329               ``down`` Ceph OSD Daemon.
 330
 331 :Type: 32-bit Integer
 332 :Default: ``2``
 333
 334
 335 ``mon osd reporter subtree level``
 336
 337 :Description: In which level of parent bucket the reporters are counted. The OSDs
 338               send failure reports to monitor if they find its peer is not responsive.
 339               And monitor mark the reported OSD out and then down after a grace period.
 340 :Type: String
 341 :Default: ``host``
 342
 343
 344 .. index:: OSD hearbeat
 345
 346 OSD Settings
 347 ------------
 348
 349 ``osd heartbeat address``
 350
 351 :Description: An Ceph OSD Daemon's network address for heartbeats.
 352 :Type: Address
 353 :Default: The host address.
 354
 355
 356 ``osd heartbeat interval``
 357
 358 :Description: How often an Ceph OSD Daemon pings its peers (in seconds).
 359 :Type: 32-bit Integer
 360 :Default: ``6``
 361
 362
 363 ``osd heartbeat grace``
 364
 365 :Description: The elapsed time when a Ceph OSD Daemon hasn't shown a heartbeat
 366               that the Ceph Storage Cluster considers it ``down``.
 367               This setting has to be set in both the [mon] and [osd] or [global]
 368               section so that it is read by both the MON and OSD daemons.
 369 :Type: 32-bit Integer
 370 :Default: ``20``
 371
 372
 373 ``osd mon heartbeat interval``
 374
 375 :Description: How often the Ceph OSD Daemon pings a Ceph Monitor if it has no
 376               Ceph OSD Daemon peers.
 377
 378 :Type: 32-bit Integer
 379 :Default: ``30``
 380
 381
 382 ``osd mon report interval max``
 383
 384 :Description: The maximum time in seconds that a Ceph OSD Daemon can wait before
 385               it must report to a Ceph Monitor.
 386
 387 :Type: 32-bit Integer
 388 :Default: ``120``
 389
 390
 391 ``osd mon report interval min``
 392
 393 :Description: The minimum number of seconds a Ceph OSD Daemon may wait
 394               from startup or another reportable event before reporting
 395               to a Ceph Monitor.
 396
 397 :Type: 32-bit Integer
 398 :Default: ``5``
 399 :Valid Range: Should be less than ``osd mon report interval max``
 400
 401
 402 ``osd mon ack timeout``
 403
 404 :Description: The number of seconds to wait for a Ceph Monitor to acknowledge a
 405               request for statistics.
 406
 407 :Type: 32-bit Integer
 408 :Default: ``30``