src/ceph/doc/rados/configuration/osd-config-ref.rst

   1 ======================
   2  OSD Config Reference
   3 ======================
   4
   5 .. index:: OSD; configuration
   6
   7 You can configure Ceph OSD Daemons in the Ceph configuration file, but Ceph OSD
   8 Daemons can use the default values and a very minimal configuration. A minimal
   9 Ceph OSD Daemon configuration sets ``osd journal size`` and ``host``,  and
  10 uses default values for nearly everything else.
  11
  12 Ceph OSD Daemons are numerically identified in incremental fashion, beginning
  13 with ``0`` using the following convention. ::
  14
  15         osd.0
  16         osd.1
  17         osd.2
  18
  19 In a configuration file, you may specify settings for all Ceph OSD Daemons in
  20 the cluster by adding configuration settings to the ``[osd]`` section of your
  21 configuration file. To add settings directly to a specific Ceph OSD Daemon
  22 (e.g., ``host``), enter  it in an OSD-specific section of your configuration
  23 file. For example:
  24
  25 .. code-block:: ini
  26
  27         [osd]
  28                 osd journal size = 1024
  29
  30         [osd.0]
  31                 host = osd-host-a
  32
  33         [osd.1]
  34                 host = osd-host-b
  35
  36
  37 .. index:: OSD; config settings
  38
  39 General Settings
  40 ================
  41
  42 The following settings provide an Ceph OSD Daemon's ID, and determine paths to
  43 data and journals. Ceph deployment scripts typically generate the UUID
  44 automatically. We **DO NOT** recommend changing the default paths for data or
  45 journals, as it makes it more problematic to troubleshoot Ceph later.
  46
  47 The journal size should be at least twice the product of the expected drive
  48 speed multiplied by ``filestore max sync interval``. However, the most common
  49 practice is to partition the journal drive (often an SSD), and mount it such
  50 that Ceph uses the entire partition for the journal.
  51
  52
  53 ``osd uuid``
  54
  55 :Description: The universally unique identifier (UUID) for the Ceph OSD Daemon.
  56 :Type: UUID
  57 :Default: The UUID.
  58 :Note: The ``osd uuid`` applies to a single Ceph OSD Daemon. The ``fsid``
  59        applies to the entire cluster.
  60
  61
  62 ``osd data``
  63
  64 :Description: The path to the OSDs data. You must create the directory when
  65               deploying Ceph. You should mount a drive for OSD data at this
  66               mount point. We do not recommend changing the default.
  67
  68 :Type: String
  69 :Default: ``/var/lib/ceph/osd/$cluster-$id``
  70
  71
  72 ``osd max write size``
  73
  74 :Description: The maximum size of a write in megabytes.
  75 :Type: 32-bit Integer
  76 :Default: ``90``
  77
  78
  79 ``osd client message size cap``
  80
  81 :Description: The largest client data message allowed in memory.
  82 :Type: 64-bit Unsigned Integer
  83 :Default: 500MB default. ``500*1024L*1024L``
  84
  85
  86 ``osd class dir``
  87
  88 :Description: The class path for RADOS class plug-ins.
  89 :Type: String
  90 :Default: ``$libdir/rados-classes``
  91
  92
  93 .. index:: OSD; file system
  94
  95 File System Settings
  96 ====================
  97 Ceph builds and mounts file systems which are used for Ceph OSDs.
  98
  99 ``osd mkfs options {fs-type}``
 100
 101 :Description: Options used when creating a new Ceph OSD of type {fs-type}.
 102
 103 :Type: String
 104 :Default for xfs: ``-f -i 2048``
 105 :Default for other file systems: {empty string}
 106
 107 For example::
 108   ``osd mkfs options xfs = -f -d agcount=24``
 109
 110 ``osd mount options {fs-type}``
 111
 112 :Description: Options used when mounting a Ceph OSD of type {fs-type}.
 113
 114 :Type: String
 115 :Default for xfs: ``rw,noatime,inode64``
 116 :Default for other file systems: ``rw, noatime``
 117
 118 For example::
 119   ``osd mount options xfs = rw, noatime, inode64, logbufs=8``
 120
 121
 122 .. index:: OSD; journal settings
 123
 124 Journal Settings
 125 ================
 126
 127 By default, Ceph expects that you will store an Ceph OSD Daemons journal with
 128 the  following path::
 129
 130         /var/lib/ceph/osd/$cluster-$id/journal
 131
 132 Without performance optimization, Ceph stores the journal on the same disk as
 133 the Ceph OSD Daemons data. An Ceph OSD Daemon optimized for performance may use
 134 a separate disk to store journal data (e.g., a solid state drive delivers high
 135 performance journaling).
 136
 137 Ceph's default ``osd journal size`` is 0, so you will need to set this in your
 138 ``ceph.conf`` file. A journal size should find the product of the ``filestore
 139 max sync interval`` and the expected throughput, and multiply the product by
 140 two (2)::
 141
 142         osd journal size = {2 * (expected throughput * filestore max sync interval)}
 143
 144 The expected throughput number should include the expected disk throughput
 145 (i.e., sustained data transfer rate), and network throughput. For example,
 146 a 7200 RPM disk will likely have approximately 100 MB/s. Taking the ``min()``
 147 of the disk and network throughput should provide a reasonable expected
 148 throughput. Some users just start off with a 10GB journal size. For
 149 example::
 150
 151         osd journal size = 10000
 152
 153
 154 ``osd journal``
 155
 156 :Description: The path to the OSD's journal. This may be a path to a file or a
 157               block device (such as a partition of an SSD). If it is a file,
 158               you must create the directory to contain it. We recommend using a
 159               drive separate from the ``osd data`` drive.
 160
 161 :Type: String
 162 :Default: ``/var/lib/ceph/osd/$cluster-$id/journal``
 163
 164
 165 ``osd journal size``
 166
 167 :Description: The size of the journal in megabytes. If this is 0, and the
 168               journal is a block device, the entire block device is used.
 169               Since v0.54, this is ignored if the journal is a block device,
 170               and the entire block device is used.
 171
 172 :Type: 32-bit Integer
 173 :Default: ``5120``
 174 :Recommended: Begin with 1GB. Should be at least twice the product of the
 175               expected speed multiplied by ``filestore max sync interval``.
 176
 177
 178 See `Journal Config Reference`_ for additional details.
 179
 180
 181 Monitor OSD Interaction
 182 =======================
 183
 184 Ceph OSD Daemons check each other's heartbeats and report to monitors
 185 periodically. Ceph can use default values in many cases. However, if your
 186 network  has latency issues, you may need to adopt longer intervals. See
 187 `Configuring Monitor/OSD Interaction`_ for a detailed discussion of heartbeats.
 188
 189
 190 Data Placement
 191 ==============
 192
 193 See `Pool & PG Config Reference`_ for details.
 194
 195
 196 .. index:: OSD; scrubbing
 197
 198 Scrubbing
 199 =========
 200
 201 In addition to making multiple copies of objects, Ceph insures data integrity by
 202 scrubbing placement groups. Ceph scrubbing is analogous to ``fsck`` on the
 203 object storage layer. For each placement group, Ceph generates a catalog of all
 204 objects and compares each primary object and its replicas to ensure that no
 205 objects are missing or mismatched. Light scrubbing (daily) checks the object
 206 size and attributes.  Deep scrubbing (weekly) reads the data and uses checksums
 207 to ensure data integrity.
 208
 209 Scrubbing is important for maintaining data integrity, but it can reduce
 210 performance. You can adjust the following settings to increase or decrease
 211 scrubbing operations.
 212
 213
 214 ``osd max scrubs``
 215
 216 :Description: The maximum number of simultaneous scrub operations for
 217               a Ceph OSD Daemon.
 218
 219 :Type: 32-bit Int
 220 :Default: ``1``
 221
 222 ``osd scrub begin hour``
 223
 224 :Description: The time of day for the lower bound when a scheduled scrub can be
 225               performed.
 226 :Type: Integer in the range of 0 to 24
 227 :Default: ``0``
 228
 229
 230 ``osd scrub end hour``
 231
 232 :Description: The time of day for the upper bound when a scheduled scrub can be
 233               performed. Along with ``osd scrub begin hour``, they define a time
 234               window, in which the scrubs can happen. But a scrub will be performed
 235               no matter the time window allows or not, as long as the placement
 236               group's scrub interval exceeds ``osd scrub max interval``.
 237 :Type: Integer in the range of 0 to 24
 238 :Default: ``24``
 239
 240
 241 ``osd scrub during recovery``
 242
 243 :Description: Allow scrub during recovery. Setting this to ``false`` will disable
 244               scheduling new scrub (and deep--scrub) while there is active recovery.
 245               Already running scrubs will be continued. This might be useful to reduce
 246               load on busy clusters.
 247 :Type: Boolean
 248 :Default: ``true``
 249
 250
 251 ``osd scrub thread timeout``
 252
 253 :Description: The maximum time in seconds before timing out a scrub thread.
 254 :Type: 32-bit Integer
 255 :Default: ``60``
 256
 257
 258 ``osd scrub finalize thread timeout``
 259
 260 :Description: The maximum time in seconds before timing out a scrub finalize
 261               thread.
 262
 263 :Type: 32-bit Integer
 264 :Default: ``60*10``
 265
 266
 267 ``osd scrub load threshold``
 268
 269 :Description: The maximum load. Ceph will not scrub when the system load
 270               (as defined by ``getloadavg()``) is higher than this number.
 271               Default is ``0.5``.
 272
 273 :Type: Float
 274 :Default: ``0.5``
 275
 276
 277 ``osd scrub min interval``
 278
 279 :Description: The minimal interval in seconds for scrubbing the Ceph OSD Daemon
 280               when the Ceph Storage Cluster load is low.
 281
 282 :Type: Float
 283 :Default: Once per day. ``60*60*24``
 284
 285
 286 ``osd scrub max interval``
 287
 288 :Description: The maximum interval in seconds for scrubbing the Ceph OSD Daemon
 289               irrespective of cluster load.
 290
 291 :Type: Float
 292 :Default: Once per week. ``7*60*60*24``
 293
 294
 295 ``osd scrub chunk min``
 296
 297 :Description: The minimal number of object store chunks to scrub during single operation.
 298               Ceph blocks writes to single chunk during scrub.
 299
 300 :Type: 32-bit Integer
 301 :Default: 5
 302
 303
 304 ``osd scrub chunk max``
 305
 306 :Description: The maximum number of object store chunks to scrub during single operation.
 307
 308 :Type: 32-bit Integer
 309 :Default: 25
 310
 311
 312 ``osd scrub sleep``
 313
 314 :Description: Time to sleep before scrubbing next group of chunks. Increasing this value will slow
 315               down whole scrub operation while client operations will be less impacted.
 316
 317 :Type: Float
 318 :Default: 0
 319
 320
 321 ``osd deep scrub interval``
 322
 323 :Description: The interval for "deep" scrubbing (fully reading all data). The
 324               ``osd scrub load threshold`` does not affect this setting.
 325
 326 :Type: Float
 327 :Default: Once per week.  ``60*60*24*7``
 328
 329
 330 ``osd scrub interval randomize ratio``
 331
 332 :Description: Add a random delay to ``osd scrub min interval`` when scheduling
 333               the next scrub job for a placement group. The delay is a random
 334               value less than ``osd scrub min interval`` \*
 335               ``osd scrub interval randomized ratio``. So the default setting
 336               practically randomly spreads the scrubs out in the allowed time
 337               window of ``[1, 1.5]`` \* ``osd scrub min interval``.
 338 :Type: Float
 339 :Default: ``0.5``
 340
 341 ``osd deep scrub stride``
 342
 343 :Description: Read size when doing a deep scrub.
 344 :Type: 32-bit Integer
 345 :Default: 512 KB. ``524288``
 346
 347
 348 .. index:: OSD; operations settings
 349
 350 Operations
 351 ==========
 352
 353 Operations settings allow you to configure the number of threads for servicing
 354 requests. If you set ``osd op threads`` to ``0``, it disables multi-threading.
 355 By default, Ceph  uses two threads with a 30 second timeout and a 30 second
 356 complaint time if an operation doesn't complete within those time parameters.
 357 You can set operations priority weights between client operations and
 358 recovery operations to ensure optimal performance during recovery.
 359
 360
 361 ``osd op threads``
 362
 363 :Description: The number of threads to service Ceph OSD Daemon operations.
 364               Set to ``0`` to disable it. Increasing the number may increase
 365               the request processing rate.
 366
 367 :Type: 32-bit Integer
 368 :Default: ``2``
 369
 370
 371 ``osd op queue``
 372
 373 :Description: This sets the type of queue to be used for prioritizing ops
 374               in the OSDs. Both queues feature a strict sub-queue which is
 375               dequeued before the normal queue. The normal queue is different
 376               between implementations. The original PrioritizedQueue (``prio``) uses a
 377               token bucket system which when there are sufficient tokens will
 378               dequeue high priority queues first. If there are not enough
 379               tokens available, queues are dequeued low priority to high priority.
 380               The WeightedPriorityQueue (``wpq``) dequeues all priorities in
 381               relation to their priorities to prevent starvation of any queue.
 382               WPQ should help in cases where a few OSDs are more overloaded
 383               than others. The new mClock based OpClassQueue
 384               (``mclock_opclass``) prioritizes operations based on which class
 385               they belong to (recovery, scrub, snaptrim, client op, osd subop).
 386               And, the mClock based ClientQueue (``mclock_client``) also
 387               incorporates the client identifier in order to promote fairness
 388               between clients. See `QoS Based on mClock`_. Requires a restart.
 389
 390 :Type: String
 391 :Valid Choices: prio, wpq, mclock_opclass, mclock_client
 392 :Default: ``prio``
 393
 394
 395 ``osd op queue cut off``
 396
 397 :Description: This selects which priority ops will be sent to the strict
 398               queue verses the normal queue. The ``low`` setting sends all
 399               replication ops and higher to the strict queue, while the ``high``
 400               option sends only replication acknowledgement ops and higher to
 401               the strict queue. Setting this to ``high`` should help when a few
 402               OSDs in the cluster are very busy especially when combined with
 403               ``wpq`` in the ``osd op queue`` setting. OSDs that are very busy
 404               handling replication traffic could starve primary client traffic
 405               on these OSDs without these settings. Requires a restart.
 406
 407 :Type: String
 408 :Valid Choices: low, high
 409 :Default: ``low``
 410
 411
 412 ``osd client op priority``
 413
 414 :Description: The priority set for client operations. It is relative to
 415               ``osd recovery op priority``.
 416
 417 :Type: 32-bit Integer
 418 :Default: ``63``
 419 :Valid Range: 1-63
 420
 421
 422 ``osd recovery op priority``
 423
 424 :Description: The priority set for recovery operations. It is relative to
 425               ``osd client op priority``.
 426
 427 :Type: 32-bit Integer
 428 :Default: ``3``
 429 :Valid Range: 1-63
 430
 431
 432 ``osd scrub priority``
 433
 434 :Description: The priority set for scrub operations. It is relative to
 435               ``osd client op priority``.
 436
 437 :Type: 32-bit Integer
 438 :Default: ``5``
 439 :Valid Range: 1-63
 440
 441
 442 ``osd snap trim priority``
 443
 444 :Description: The priority set for snap trim operations. It is relative to
 445               ``osd client op priority``.
 446
 447 :Type: 32-bit Integer
 448 :Default: ``5``
 449 :Valid Range: 1-63
 450
 451
 452 ``osd op thread timeout``
 453
 454 :Description: The Ceph OSD Daemon operation thread timeout in seconds.
 455 :Type: 32-bit Integer
 456 :Default: ``15``
 457
 458
 459 ``osd op complaint time``
 460
 461 :Description: An operation becomes complaint worthy after the specified number
 462               of seconds have elapsed.
 463
 464 :Type: Float
 465 :Default: ``30``
 466
 467
 468 ``osd disk threads``
 469
 470 :Description: The number of disk threads, which are used to perform background
 471               disk intensive OSD operations such as scrubbing and snap
 472               trimming.
 473
 474 :Type: 32-bit Integer
 475 :Default: ``1``
 476
 477 ``osd disk thread ioprio class``
 478
 479 :Description: Warning: it will only be used if both ``osd disk thread
 480               ioprio class`` and ``osd disk thread ioprio priority`` are
 481               set to a non default value.  Sets the ioprio_set(2) I/O
 482               scheduling ``class`` for the disk thread. Acceptable
 483               values are ``idle``, ``be`` or ``rt``. The ``idle``
 484               class means the disk thread will have lower priority
 485               than any other thread in the OSD. This is useful to slow
 486               down scrubbing on an OSD that is busy handling client
 487               operations. ``be`` is the default and is the same
 488               priority as all other threads in the OSD. ``rt`` means
 489               the disk thread will have precendence over all other
 490               threads in the OSD. Note: Only works with the Linux Kernel
 491               CFQ scheduler. Since Jewel scrubbing is no longer carried
 492               out by the disk iothread, see osd priority options instead.
 493 :Type: String
 494 :Default: the empty string
 495
 496 ``osd disk thread ioprio priority``
 497
 498 :Description: Warning: it will only be used if both ``osd disk thread
 499               ioprio class`` and ``osd disk thread ioprio priority`` are
 500               set to a non default value. It sets the ioprio_set(2)
 501               I/O scheduling ``priority`` of the disk thread ranging
 502               from 0 (highest) to 7 (lowest). If all OSDs on a given
 503               host were in class ``idle`` and compete for I/O
 504               (i.e. due to controller congestion), it can be used to
 505               lower the disk thread priority of one OSD to 7 so that
 506               another OSD with priority 0 can have priority.
 507               Note: Only works with the Linux Kernel CFQ scheduler.
 508 :Type: Integer in the range of 0 to 7 or -1 if not to be used.
 509 :Default: ``-1``
 510
 511 ``osd op history size``
 512
 513 :Description: The maximum number of completed operations to track.
 514 :Type: 32-bit Unsigned Integer
 515 :Default: ``20``
 516
 517
 518 ``osd op history duration``
 519
 520 :Description: The oldest completed operation to track.
 521 :Type: 32-bit Unsigned Integer
 522 :Default: ``600``
 523
 524
 525 ``osd op log threshold``
 526
 527 :Description: How many operations logs to display at once.
 528 :Type: 32-bit Integer
 529 :Default: ``5``
 530
 531
 532 QoS Based on mClock
 533 -------------------
 534
 535 Ceph's use of mClock is currently in the experimental phase and should
 536 be approached with an exploratory mindset.
 537
 538 Core Concepts
 539 `````````````
 540
 541 The QoS support of Ceph is implemented using a queueing scheduler
 542 based on `the dmClock algorithm`_. This algorithm allocates the I/O
 543 resources of the Ceph cluster in proportion to weights, and enforces
 544 the constraits of minimum reservation and maximum limitation, so that
 545 the services can compete for the resources fairly. Currently the
 546 *mclock_opclass* operation queue divides Ceph services involving I/O
 547 resources into following buckets:
 548
 549 - client op: the iops issued by client
 550 - osd subop: the iops issued by primary OSD
 551 - snap trim: the snap trimming related requests
 552 - pg recovery: the recovery related requests
 553 - pg scrub: the scrub related requests
 554
 555 And the resources are partitioned using following three sets of tags. In other
 556 words, the share of each type of service is controlled by three tags:
 557
 558 #. reservation: the minimum IOPS allocated for the service.
 559 #. limitation: the maximum IOPS allocated for the service.
 560 #. weight: the proportional share of capacity if extra capacity or system
 561    oversubscribed.
 562
 563 In Ceph operations are graded with "cost". And the resources allocated
 564 for serving various services are consumed by these "costs". So, for
 565 example, the more reservation a services has, the more resource it is
 566 guaranteed to possess, as long as it requires. Assuming there are 2
 567 services: recovery and client ops:
 568
 569 - recovery: (r:1, l:5, w:1)
 570 - client ops: (r:2, l:0, w:9)
 571
 572 The settings above ensure that the recovery won't get more than 5
 573 requests per second serviced, even if it requires so (see CURRENT
 574 IMPLEMENTATION NOTE below), and no other services are competing with
 575 it. But if the clients start to issue large amount of I/O requests,
 576 neither will they exhaust all the I/O resources. 1 request per second
 577 is always allocated for recovery jobs as long as there are any such
 578 requests. So the recovery jobs won't be starved even in a cluster with
 579 high load. And in the meantime, the client ops can enjoy a larger
 580 portion of the I/O resource, because its weight is "9", while its
 581 competitor "1". In the case of client ops, it is not clamped by the
 582 limit setting, so it can make use of all the resources if there is no
 583 recovery ongoing.
 584
 585 Along with *mclock_opclass* another mclock operation queue named
 586 *mclock_client* is available. It divides operations based on category
 587 but also divides them based on the client making the request. This
 588 helps not only manage the distribution of resources spent on different
 589 classes of operations but also tries to insure fairness among clients.
 590
 591 CURRENT IMPLEMENTATION NOTE: the current experimental implementation
 592 does not enforce the limit values. As a first approximation we decided
 593 not to prevent operations that would otherwise enter the operation
 594 sequencer from doing so.
 595
 596 Subtleties of mClock
 597 ````````````````````
 598
 599 The reservation and limit values have a unit of requests per
 600 second. The weight, however, does not technically have a unit and the
 601 weights are relative to one another. So if one class of requests has a
 602 weight of 1 and another a weight of 9, then the latter class of
 603 requests should get 9 executed at a 9 to 1 ratio as the first class.
 604 However that will only happen once the reservations are met and those
 605 values include the operations executed under the reservation phase.
 606
 607 Even though the weights do not have units, one must be careful in
 608 choosing their values due how the algorithm assigns weight tags to
 609 requests. If the weight is *W*, then for a given class of requests,
 610 the next one that comes in will have a weight tag of *1/W* plus the
 611 previous weight tag or the current time, whichever is larger. That
 612 means if *W* is sufficiently large and therefore *1/W* is sufficiently
 613 small, the calculated tag may never be assigned as it will get a value
 614 of the current time. The ultimate lesson is that values for weight
 615 should not be too large. They should be under the number of requests
 616 one expects to ve serviced each second.
 617
 618 Caveats
 619 ```````
 620
 621 There are some factors that can reduce the impact of the mClock op
 622 queues within Ceph. First, requests to an OSD are sharded by their
 623 placement group identifier. Each shard has its own mClock queue and
 624 these queues neither interact nor share information among them. The
 625 number of shards can be controlled with the configuration options
 626 ``osd_op_num_shards``, ``osd_op_num_shards_hdd``, and
 627 ``osd_op_num_shards_ssd``. A lower number of shards will increase the
 628 impact of the mClock queues, but may have other deliterious effects.
 629
 630 Second, requests are transferred from the operation queue to the
 631 operation sequencer, in which they go through the phases of
 632 execution. The operation queue is where mClock resides and mClock
 633 determines the next op to transfer to the operation sequencer. The
 634 number of operations allowed in the operation sequencer is a complex
 635 issue. In general we want to keep enough operations in the sequencer
 636 so it's always getting work done on some operations while it's waiting
 637 for disk and network access to complete on other operations. On the
 638 other hand, once an operation is transferred to the operation
 639 sequencer, mClock no longer has control over it. Therefore to maximize
 640 the impact of mClock, we want to keep as few operations in the
 641 operation sequencer as possible. So we have an inherent tension.
 642
 643 The configuration options that influence the number of operations in
 644 the operation sequencer are ``bluestore_throttle_bytes``,
 645 ``bluestore_throttle_deferred_bytes``,
 646 ``bluestore_throttle_cost_per_io``,
 647 ``bluestore_throttle_cost_per_io_hdd``, and
 648 ``bluestore_throttle_cost_per_io_ssd``.
 649
 650 A third factor that affects the impact of the mClock algorithm is that
 651 we're using a distributed system, where requests are made to multiple
 652 OSDs and each OSD has (can have) multiple shards. Yet we're currently
 653 using the mClock algorithm, which is not distributed (note: dmClock is
 654 the distributed version of mClock).
 655
 656 Various organizations and individuals are currently experimenting with
 657 mClock as it exists in this code base along with their modifications
 658 to the code base. We hope you'll share you're experiences with your
 659 mClock and dmClock experiments in the ceph-devel mailing list.
 660
 661
 662 ``osd push per object cost``
 663
 664 :Description: the overhead for serving a push op
 665
 666 :Type: Unsigned Integer
 667 :Default: 1000
 668
 669 ``osd recovery max chunk``
 670
 671 :Description: the maximum total size of data chunks a recovery op can carry.
 672
 673 :Type: Unsigned Integer
 674 :Default: 8 MiB
 675
 676
 677 ``osd op queue mclock client op res``
 678
 679 :Description: the reservation of client op.
 680
 681 :Type: Float
 682 :Default: 1000.0
 683
 684
 685 ``osd op queue mclock client op wgt``
 686
 687 :Description: the weight of client op.
 688
 689 :Type: Float
 690 :Default: 500.0
 691
 692
 693 ``osd op queue mclock client op lim``
 694
 695 :Description: the limit of client op.
 696
 697 :Type: Float
 698 :Default: 1000.0
 699
 700
 701 ``osd op queue mclock osd subop res``
 702
 703 :Description: the reservation of osd subop.
 704
 705 :Type: Float
 706 :Default: 1000.0
 707
 708
 709 ``osd op queue mclock osd subop wgt``
 710
 711 :Description: the weight of osd subop.
 712
 713 :Type: Float
 714 :Default: 500.0
 715
 716
 717 ``osd op queue mclock osd subop lim``
 718
 719 :Description: the limit of osd subop.
 720
 721 :Type: Float
 722 :Default: 0.0
 723
 724
 725 ``osd op queue mclock snap res``
 726
 727 :Description: the reservation of snap trimming.
 728
 729 :Type: Float
 730 :Default: 0.0
 731
 732
 733 ``osd op queue mclock snap wgt``
 734
 735 :Description: the weight of snap trimming.
 736
 737 :Type: Float
 738 :Default: 1.0
 739
 740
 741 ``osd op queue mclock snap lim``
 742
 743 :Description: the limit of snap trimming.
 744
 745 :Type: Float
 746 :Default: 0.001
 747
 748
 749 ``osd op queue mclock recov res``
 750
 751 :Description: the reservation of recovery.
 752
 753 :Type: Float
 754 :Default: 0.0
 755
 756
 757 ``osd op queue mclock recov wgt``
 758
 759 :Description: the weight of recovery.
 760
 761 :Type: Float
 762 :Default: 1.0
 763
 764
 765 ``osd op queue mclock recov lim``
 766
 767 :Description: the limit of recovery.
 768
 769 :Type: Float
 770 :Default: 0.001
 771
 772
 773 ``osd op queue mclock scrub res``
 774
 775 :Description: the reservation of scrub jobs.
 776
 777 :Type: Float
 778 :Default: 0.0
 779
 780
 781 ``osd op queue mclock scrub wgt``
 782
 783 :Description: the weight of scrub jobs.
 784
 785 :Type: Float
 786 :Default: 1.0
 787
 788
 789 ``osd op queue mclock scrub lim``
 790
 791 :Description: the limit of scrub jobs.
 792
 793 :Type: Float
 794 :Default: 0.001
 795
 796 .. _the dmClock algorithm: https://www.usenix.org/legacy/event/osdi10/tech/full_papers/Gulati.pdf
 797
 798
 799 .. index:: OSD; backfilling
 800
 801 Backfilling
 802 ===========
 803
 804 When you add or remove Ceph OSD Daemons to a cluster, the CRUSH algorithm will
 805 want to rebalance the cluster by moving placement groups to or from Ceph OSD
 806 Daemons to restore the balance. The process of migrating placement groups and
 807 the objects they contain can reduce the cluster's operational performance
 808 considerably. To maintain operational performance, Ceph performs this migration
 809 with 'backfilling', which allows Ceph to set backfill operations to a lower
 810 priority than requests to read or write data.
 811
 812
 813 ``osd max backfills``
 814
 815 :Description: The maximum number of backfills allowed to or from a single OSD.
 816 :Type: 64-bit Unsigned Integer
 817 :Default: ``1``
 818
 819
 820 ``osd backfill scan min``
 821
 822 :Description: The minimum number of objects per backfill scan.
 823
 824 :Type: 32-bit Integer
 825 :Default: ``64``
 826
 827
 828 ``osd backfill scan max``
 829
 830 :Description: The maximum number of objects per backfill scan.
 831
 832 :Type: 32-bit Integer
 833 :Default: ``512``
 834
 835
 836 ``osd backfill retry interval``
 837
 838 :Description: The number of seconds to wait before retrying backfill requests.
 839 :Type: Double
 840 :Default: ``10.0``
 841
 842 .. index:: OSD; osdmap
 843
 844 OSD Map
 845 =======
 846
 847 OSD maps reflect the OSD daemons operating in the cluster. Over time, the
 848 number of map epochs increases. Ceph provides some settings to ensure that
 849 Ceph performs well as the OSD map grows larger.
 850
 851
 852 ``osd map dedup``
 853
 854 :Description: Enable removing duplicates in the OSD map.
 855 :Type: Boolean
 856 :Default: ``true``
 857
 858
 859 ``osd map cache size``
 860
 861 :Description: The number of OSD maps to keep cached.
 862 :Type: 32-bit Integer
 863 :Default: ``500``
 864
 865
 866 ``osd map cache bl size``
 867
 868 :Description: The size of the in-memory OSD map cache in OSD daemons.
 869 :Type: 32-bit Integer
 870 :Default: ``50``
 871
 872
 873 ``osd map cache bl inc size``
 874
 875 :Description: The size of the in-memory OSD map cache incrementals in
 876               OSD daemons.
 877
 878 :Type: 32-bit Integer
 879 :Default: ``100``
 880
 881
 882 ``osd map message max``
 883
 884 :Description: The maximum map entries allowed per MOSDMap message.
 885 :Type: 32-bit Integer
 886 :Default: ``100``
 887
 888
 889
 890 .. index:: OSD; recovery
 891
 892 Recovery
 893 ========
 894
 895 When the cluster starts or when a Ceph OSD Daemon crashes and restarts, the OSD
 896 begins peering with other Ceph OSD Daemons before writes can occur.  See
 897 `Monitoring OSDs and PGs`_ for details.
 898
 899 If a Ceph OSD Daemon crashes and comes back online, usually it will be out of
 900 sync with other Ceph OSD Daemons containing more recent versions of objects in
 901 the placement groups. When this happens, the Ceph OSD Daemon goes into recovery
 902 mode and seeks to get the latest copy of the data and bring its map back up to
 903 date. Depending upon how long the Ceph OSD Daemon was down, the OSD's objects
 904 and placement groups may be significantly out of date. Also, if a failure domain
 905 went down (e.g., a rack), more than one Ceph OSD Daemon may come back online at
 906 the same time. This can make the recovery process time consuming and resource
 907 intensive.
 908
 909 To maintain operational performance, Ceph performs recovery with limitations on
 910 the number recovery requests, threads and object chunk sizes which allows Ceph
 911 perform well in a degraded state.
 912
 913
 914 ``osd recovery delay start``
 915
 916 :Description: After peering completes, Ceph will delay for the specified number
 917               of seconds before starting to recover objects.
 918
 919 :Type: Float
 920 :Default: ``0``
 921
 922
 923 ``osd recovery max active``
 924
 925 :Description: The number of active recovery requests per OSD at one time. More
 926               requests will accelerate recovery, but the requests places an
 927               increased load on the cluster.
 928
 929 :Type: 32-bit Integer
 930 :Default: ``3``
 931
 932
 933 ``osd recovery max chunk``
 934
 935 :Description: The maximum size of a recovered chunk of data to push.
 936 :Type: 64-bit Unsigned Integer
 937 :Default: ``8 << 20``
 938
 939
 940 ``osd recovery max single start``
 941
 942 :Description: The maximum number of recovery operations per OSD that will be
 943               newly started when an OSD is recovering.
 944 :Type: 64-bit Unsigned Integer
 945 :Default: ``1``
 946
 947
 948 ``osd recovery thread timeout``
 949
 950 :Description: The maximum time in seconds before timing out a recovery thread.
 951 :Type: 32-bit Integer
 952 :Default: ``30``
 953
 954
 955 ``osd recover clone overlap``
 956
 957 :Description: Preserves clone overlap during recovery. Should always be set
 958               to ``true``.
 959
 960 :Type: Boolean
 961 :Default: ``true``
 962
 963
 964 ``osd recovery sleep``
 965
 966 :Description: Time in seconds to sleep before next recovery or backfill op.
 967               Increasing this value will slow down recovery operation while
 968               client operations will be less impacted.
 969
 970 :Type: Float
 971 :Default: ``0``
 972
 973
 974 ``osd recovery sleep hdd``
 975
 976 :Description: Time in seconds to sleep before next recovery or backfill op
 977               for HDDs.
 978
 979 :Type: Float
 980 :Default: ``0.1``
 981
 982
 983 ``osd recovery sleep ssd``
 984
 985 :Description: Time in seconds to sleep before next recovery or backfill op
 986               for SSDs.
 987
 988 :Type: Float
 989 :Default: ``0``
 990
 991
 992 ``osd recovery sleep hybrid``
 993
 994 :Description: Time in seconds to sleep before next recovery or backfill op
 995               when osd data is on HDD and osd journal is on SSD.
 996
 997 :Type: Float
 998 :Default: ``0.025``
 999
1000 Tiering
1001 =======
1002
1003 ``osd agent max ops``
1004
1005 :Description: The maximum number of simultaneous flushing ops per tiering agent
1006               in the high speed mode.
1007 :Type: 32-bit Integer
1008 :Default: ``4``
1009
1010
1011 ``osd agent max low ops``
1012
1013 :Description: The maximum number of simultaneous flushing ops per tiering agent
1014               in the low speed mode.
1015 :Type: 32-bit Integer
1016 :Default: ``2``
1017
1018 See `cache target dirty high ratio`_ for when the tiering agent flushes dirty
1019 objects within the high speed mode.
1020
1021 Miscellaneous
1022 =============
1023
1024
1025 ``osd snap trim thread timeout``
1026
1027 :Description: The maximum time in seconds before timing out a snap trim thread.
1028 :Type: 32-bit Integer
1029 :Default: ``60*60*1``
1030
1031
1032 ``osd backlog thread timeout``
1033
1034 :Description: The maximum time in seconds before timing out a backlog thread.
1035 :Type: 32-bit Integer
1036 :Default: ``60*60*1``
1037
1038
1039 ``osd default notify timeout``
1040
1041 :Description: The OSD default notification timeout (in seconds).
1042 :Type: 32-bit Unsigned Integer
1043 :Default: ``30``
1044
1045
1046 ``osd check for log corruption``
1047
1048 :Description: Check log files for corruption. Can be computationally expensive.
1049 :Type: Boolean
1050 :Default: ``false``
1051
1052
1053 ``osd remove thread timeout``
1054
1055 :Description: The maximum time in seconds before timing out a remove OSD thread.
1056 :Type: 32-bit Integer
1057 :Default: ``60*60``
1058
1059
1060 ``osd command thread timeout``
1061
1062 :Description: The maximum time in seconds before timing out a command thread.
1063 :Type: 32-bit Integer
1064 :Default: ``10*60``
1065
1066
1067 ``osd command max records``
1068
1069 :Description: Limits the number of lost objects to return.
1070 :Type: 32-bit Integer
1071 :Default: ``256``
1072
1073
1074 ``osd auto upgrade tmap``
1075
1076 :Description: Uses ``tmap`` for ``omap`` on old objects.
1077 :Type: Boolean
1078 :Default: ``true``
1079
1080
1081 ``osd tmapput sets users tmap``
1082
1083 :Description: Uses ``tmap`` for debugging only.
1084 :Type: Boolean
1085 :Default: ``false``
1086
1087
1088 ``osd fast fail on connection refused``
1089
1090 :Description: If this option is enabled, crashed OSDs are marked down
1091               immediately by connected peers and MONs (assuming that the
1092               crashed OSD host survives). Disable it to restore old
1093               behavior, at the expense of possible long I/O stalls when
1094               OSDs crash in the middle of I/O operations.
1095 :Type: Boolean
1096 :Default: ``true``
1097
1098
1099
1100 .. _pool: ../../operations/pools
1101 .. _Configuring Monitor/OSD Interaction: ../mon-osd-interaction
1102 .. _Monitoring OSDs and PGs: ../../operations/monitoring-osd-pg#peering
1103 .. _Pool & PG Config Reference: ../pool-pg-config-ref
1104 .. _Journal Config Reference: ../journal-ref
1105 .. _cache target dirty high ratio: ../../operations/pools#cache-target-dirty-high-ratio