X-Git-Url: https://gerrit.opnfv.org/gerrit/gitweb?a=blobdiff_plain;f=src%2Fceph%2Fdoc%2Frados%2Fconfiguration%2Fosd-config-ref.rst;fp=src%2Fceph%2Fdoc%2Frados%2Fconfiguration%2Fosd-config-ref.rst;h=0000000000000000000000000000000000000000;hb=7da45d65be36d36b880cc55c5036e96c24b53f00;hp=fae7078930b25c8bc19aa1181b660163ea14a342;hpb=691462d09d0987b47e112d6ee8740375df3c51b2;p=stor4nfv.git diff --git a/src/ceph/doc/rados/configuration/osd-config-ref.rst b/src/ceph/doc/rados/configuration/osd-config-ref.rst deleted file mode 100644 index fae7078..0000000 --- a/src/ceph/doc/rados/configuration/osd-config-ref.rst +++ /dev/null @@ -1,1105 +0,0 @@ -====================== - OSD Config Reference -====================== - -.. index:: OSD; configuration - -You can configure Ceph OSD Daemons in the Ceph configuration file, but Ceph OSD -Daemons can use the default values and a very minimal configuration. A minimal -Ceph OSD Daemon configuration sets ``osd journal size`` and ``host``, and -uses default values for nearly everything else. - -Ceph OSD Daemons are numerically identified in incremental fashion, beginning -with ``0`` using the following convention. :: - - osd.0 - osd.1 - osd.2 - -In a configuration file, you may specify settings for all Ceph OSD Daemons in -the cluster by adding configuration settings to the ``[osd]`` section of your -configuration file. To add settings directly to a specific Ceph OSD Daemon -(e.g., ``host``), enter it in an OSD-specific section of your configuration -file. For example: - -.. code-block:: ini - - [osd] - osd journal size = 1024 - - [osd.0] - host = osd-host-a - - [osd.1] - host = osd-host-b - - -.. index:: OSD; config settings - -General Settings -================ - -The following settings provide an Ceph OSD Daemon's ID, and determine paths to -data and journals. Ceph deployment scripts typically generate the UUID -automatically. We **DO NOT** recommend changing the default paths for data or -journals, as it makes it more problematic to troubleshoot Ceph later. - -The journal size should be at least twice the product of the expected drive -speed multiplied by ``filestore max sync interval``. However, the most common -practice is to partition the journal drive (often an SSD), and mount it such -that Ceph uses the entire partition for the journal. - - -``osd uuid`` - -:Description: The universally unique identifier (UUID) for the Ceph OSD Daemon. -:Type: UUID -:Default: The UUID. -:Note: The ``osd uuid`` applies to a single Ceph OSD Daemon. The ``fsid`` - applies to the entire cluster. - - -``osd data`` - -:Description: The path to the OSDs data. You must create the directory when - deploying Ceph. You should mount a drive for OSD data at this - mount point. We do not recommend changing the default. - -:Type: String -:Default: ``/var/lib/ceph/osd/$cluster-$id`` - - -``osd max write size`` - -:Description: The maximum size of a write in megabytes. -:Type: 32-bit Integer -:Default: ``90`` - - -``osd client message size cap`` - -:Description: The largest client data message allowed in memory. -:Type: 64-bit Unsigned Integer -:Default: 500MB default. ``500*1024L*1024L`` - - -``osd class dir`` - -:Description: The class path for RADOS class plug-ins. -:Type: String -:Default: ``$libdir/rados-classes`` - - -.. index:: OSD; file system - -File System Settings -==================== -Ceph builds and mounts file systems which are used for Ceph OSDs. - -``osd mkfs options {fs-type}`` - -:Description: Options used when creating a new Ceph OSD of type {fs-type}. - -:Type: String -:Default for xfs: ``-f -i 2048`` -:Default for other file systems: {empty string} - -For example:: - ``osd mkfs options xfs = -f -d agcount=24`` - -``osd mount options {fs-type}`` - -:Description: Options used when mounting a Ceph OSD of type {fs-type}. - -:Type: String -:Default for xfs: ``rw,noatime,inode64`` -:Default for other file systems: ``rw, noatime`` - -For example:: - ``osd mount options xfs = rw, noatime, inode64, logbufs=8`` - - -.. index:: OSD; journal settings - -Journal Settings -================ - -By default, Ceph expects that you will store an Ceph OSD Daemons journal with -the following path:: - - /var/lib/ceph/osd/$cluster-$id/journal - -Without performance optimization, Ceph stores the journal on the same disk as -the Ceph OSD Daemons data. An Ceph OSD Daemon optimized for performance may use -a separate disk to store journal data (e.g., a solid state drive delivers high -performance journaling). - -Ceph's default ``osd journal size`` is 0, so you will need to set this in your -``ceph.conf`` file. A journal size should find the product of the ``filestore -max sync interval`` and the expected throughput, and multiply the product by -two (2):: - - osd journal size = {2 * (expected throughput * filestore max sync interval)} - -The expected throughput number should include the expected disk throughput -(i.e., sustained data transfer rate), and network throughput. For example, -a 7200 RPM disk will likely have approximately 100 MB/s. Taking the ``min()`` -of the disk and network throughput should provide a reasonable expected -throughput. Some users just start off with a 10GB journal size. For -example:: - - osd journal size = 10000 - - -``osd journal`` - -:Description: The path to the OSD's journal. This may be a path to a file or a - block device (such as a partition of an SSD). If it is a file, - you must create the directory to contain it. We recommend using a - drive separate from the ``osd data`` drive. - -:Type: String -:Default: ``/var/lib/ceph/osd/$cluster-$id/journal`` - - -``osd journal size`` - -:Description: The size of the journal in megabytes. If this is 0, and the - journal is a block device, the entire block device is used. - Since v0.54, this is ignored if the journal is a block device, - and the entire block device is used. - -:Type: 32-bit Integer -:Default: ``5120`` -:Recommended: Begin with 1GB. Should be at least twice the product of the - expected speed multiplied by ``filestore max sync interval``. - - -See `Journal Config Reference`_ for additional details. - - -Monitor OSD Interaction -======================= - -Ceph OSD Daemons check each other's heartbeats and report to monitors -periodically. Ceph can use default values in many cases. However, if your -network has latency issues, you may need to adopt longer intervals. See -`Configuring Monitor/OSD Interaction`_ for a detailed discussion of heartbeats. - - -Data Placement -============== - -See `Pool & PG Config Reference`_ for details. - - -.. index:: OSD; scrubbing - -Scrubbing -========= - -In addition to making multiple copies of objects, Ceph insures data integrity by -scrubbing placement groups. Ceph scrubbing is analogous to ``fsck`` on the -object storage layer. For each placement group, Ceph generates a catalog of all -objects and compares each primary object and its replicas to ensure that no -objects are missing or mismatched. Light scrubbing (daily) checks the object -size and attributes. Deep scrubbing (weekly) reads the data and uses checksums -to ensure data integrity. - -Scrubbing is important for maintaining data integrity, but it can reduce -performance. You can adjust the following settings to increase or decrease -scrubbing operations. - - -``osd max scrubs`` - -:Description: The maximum number of simultaneous scrub operations for - a Ceph OSD Daemon. - -:Type: 32-bit Int -:Default: ``1`` - -``osd scrub begin hour`` - -:Description: The time of day for the lower bound when a scheduled scrub can be - performed. -:Type: Integer in the range of 0 to 24 -:Default: ``0`` - - -``osd scrub end hour`` - -:Description: The time of day for the upper bound when a scheduled scrub can be - performed. Along with ``osd scrub begin hour``, they define a time - window, in which the scrubs can happen. But a scrub will be performed - no matter the time window allows or not, as long as the placement - group's scrub interval exceeds ``osd scrub max interval``. -:Type: Integer in the range of 0 to 24 -:Default: ``24`` - - -``osd scrub during recovery`` - -:Description: Allow scrub during recovery. Setting this to ``false`` will disable - scheduling new scrub (and deep--scrub) while there is active recovery. - Already running scrubs will be continued. This might be useful to reduce - load on busy clusters. -:Type: Boolean -:Default: ``true`` - - -``osd scrub thread timeout`` - -:Description: The maximum time in seconds before timing out a scrub thread. -:Type: 32-bit Integer -:Default: ``60`` - - -``osd scrub finalize thread timeout`` - -:Description: The maximum time in seconds before timing out a scrub finalize - thread. - -:Type: 32-bit Integer -:Default: ``60*10`` - - -``osd scrub load threshold`` - -:Description: The maximum load. Ceph will not scrub when the system load - (as defined by ``getloadavg()``) is higher than this number. - Default is ``0.5``. - -:Type: Float -:Default: ``0.5`` - - -``osd scrub min interval`` - -:Description: The minimal interval in seconds for scrubbing the Ceph OSD Daemon - when the Ceph Storage Cluster load is low. - -:Type: Float -:Default: Once per day. ``60*60*24`` - - -``osd scrub max interval`` - -:Description: The maximum interval in seconds for scrubbing the Ceph OSD Daemon - irrespective of cluster load. - -:Type: Float -:Default: Once per week. ``7*60*60*24`` - - -``osd scrub chunk min`` - -:Description: The minimal number of object store chunks to scrub during single operation. - Ceph blocks writes to single chunk during scrub. - -:Type: 32-bit Integer -:Default: 5 - - -``osd scrub chunk max`` - -:Description: The maximum number of object store chunks to scrub during single operation. - -:Type: 32-bit Integer -:Default: 25 - - -``osd scrub sleep`` - -:Description: Time to sleep before scrubbing next group of chunks. Increasing this value will slow - down whole scrub operation while client operations will be less impacted. - -:Type: Float -:Default: 0 - - -``osd deep scrub interval`` - -:Description: The interval for "deep" scrubbing (fully reading all data). The - ``osd scrub load threshold`` does not affect this setting. - -:Type: Float -:Default: Once per week. ``60*60*24*7`` - - -``osd scrub interval randomize ratio`` - -:Description: Add a random delay to ``osd scrub min interval`` when scheduling - the next scrub job for a placement group. The delay is a random - value less than ``osd scrub min interval`` \* - ``osd scrub interval randomized ratio``. So the default setting - practically randomly spreads the scrubs out in the allowed time - window of ``[1, 1.5]`` \* ``osd scrub min interval``. -:Type: Float -:Default: ``0.5`` - -``osd deep scrub stride`` - -:Description: Read size when doing a deep scrub. -:Type: 32-bit Integer -:Default: 512 KB. ``524288`` - - -.. index:: OSD; operations settings - -Operations -========== - -Operations settings allow you to configure the number of threads for servicing -requests. If you set ``osd op threads`` to ``0``, it disables multi-threading. -By default, Ceph uses two threads with a 30 second timeout and a 30 second -complaint time if an operation doesn't complete within those time parameters. -You can set operations priority weights between client operations and -recovery operations to ensure optimal performance during recovery. - - -``osd op threads`` - -:Description: The number of threads to service Ceph OSD Daemon operations. - Set to ``0`` to disable it. Increasing the number may increase - the request processing rate. - -:Type: 32-bit Integer -:Default: ``2`` - - -``osd op queue`` - -:Description: This sets the type of queue to be used for prioritizing ops - in the OSDs. Both queues feature a strict sub-queue which is - dequeued before the normal queue. The normal queue is different - between implementations. The original PrioritizedQueue (``prio``) uses a - token bucket system which when there are sufficient tokens will - dequeue high priority queues first. If there are not enough - tokens available, queues are dequeued low priority to high priority. - The WeightedPriorityQueue (``wpq``) dequeues all priorities in - relation to their priorities to prevent starvation of any queue. - WPQ should help in cases where a few OSDs are more overloaded - than others. The new mClock based OpClassQueue - (``mclock_opclass``) prioritizes operations based on which class - they belong to (recovery, scrub, snaptrim, client op, osd subop). - And, the mClock based ClientQueue (``mclock_client``) also - incorporates the client identifier in order to promote fairness - between clients. See `QoS Based on mClock`_. Requires a restart. - -:Type: String -:Valid Choices: prio, wpq, mclock_opclass, mclock_client -:Default: ``prio`` - - -``osd op queue cut off`` - -:Description: This selects which priority ops will be sent to the strict - queue verses the normal queue. The ``low`` setting sends all - replication ops and higher to the strict queue, while the ``high`` - option sends only replication acknowledgement ops and higher to - the strict queue. Setting this to ``high`` should help when a few - OSDs in the cluster are very busy especially when combined with - ``wpq`` in the ``osd op queue`` setting. OSDs that are very busy - handling replication traffic could starve primary client traffic - on these OSDs without these settings. Requires a restart. - -:Type: String -:Valid Choices: low, high -:Default: ``low`` - - -``osd client op priority`` - -:Description: The priority set for client operations. It is relative to - ``osd recovery op priority``. - -:Type: 32-bit Integer -:Default: ``63`` -:Valid Range: 1-63 - - -``osd recovery op priority`` - -:Description: The priority set for recovery operations. It is relative to - ``osd client op priority``. - -:Type: 32-bit Integer -:Default: ``3`` -:Valid Range: 1-63 - - -``osd scrub priority`` - -:Description: The priority set for scrub operations. It is relative to - ``osd client op priority``. - -:Type: 32-bit Integer -:Default: ``5`` -:Valid Range: 1-63 - - -``osd snap trim priority`` - -:Description: The priority set for snap trim operations. It is relative to - ``osd client op priority``. - -:Type: 32-bit Integer -:Default: ``5`` -:Valid Range: 1-63 - - -``osd op thread timeout`` - -:Description: The Ceph OSD Daemon operation thread timeout in seconds. -:Type: 32-bit Integer -:Default: ``15`` - - -``osd op complaint time`` - -:Description: An operation becomes complaint worthy after the specified number - of seconds have elapsed. - -:Type: Float -:Default: ``30`` - - -``osd disk threads`` - -:Description: The number of disk threads, which are used to perform background - disk intensive OSD operations such as scrubbing and snap - trimming. - -:Type: 32-bit Integer -:Default: ``1`` - -``osd disk thread ioprio class`` - -:Description: Warning: it will only be used if both ``osd disk thread - ioprio class`` and ``osd disk thread ioprio priority`` are - set to a non default value. Sets the ioprio_set(2) I/O - scheduling ``class`` for the disk thread. Acceptable - values are ``idle``, ``be`` or ``rt``. The ``idle`` - class means the disk thread will have lower priority - than any other thread in the OSD. This is useful to slow - down scrubbing on an OSD that is busy handling client - operations. ``be`` is the default and is the same - priority as all other threads in the OSD. ``rt`` means - the disk thread will have precendence over all other - threads in the OSD. Note: Only works with the Linux Kernel - CFQ scheduler. Since Jewel scrubbing is no longer carried - out by the disk iothread, see osd priority options instead. -:Type: String -:Default: the empty string - -``osd disk thread ioprio priority`` - -:Description: Warning: it will only be used if both ``osd disk thread - ioprio class`` and ``osd disk thread ioprio priority`` are - set to a non default value. It sets the ioprio_set(2) - I/O scheduling ``priority`` of the disk thread ranging - from 0 (highest) to 7 (lowest). If all OSDs on a given - host were in class ``idle`` and compete for I/O - (i.e. due to controller congestion), it can be used to - lower the disk thread priority of one OSD to 7 so that - another OSD with priority 0 can have priority. - Note: Only works with the Linux Kernel CFQ scheduler. -:Type: Integer in the range of 0 to 7 or -1 if not to be used. -:Default: ``-1`` - -``osd op history size`` - -:Description: The maximum number of completed operations to track. -:Type: 32-bit Unsigned Integer -:Default: ``20`` - - -``osd op history duration`` - -:Description: The oldest completed operation to track. -:Type: 32-bit Unsigned Integer -:Default: ``600`` - - -``osd op log threshold`` - -:Description: How many operations logs to display at once. -:Type: 32-bit Integer -:Default: ``5`` - - -QoS Based on mClock -------------------- - -Ceph's use of mClock is currently in the experimental phase and should -be approached with an exploratory mindset. - -Core Concepts -````````````` - -The QoS support of Ceph is implemented using a queueing scheduler -based on `the dmClock algorithm`_. This algorithm allocates the I/O -resources of the Ceph cluster in proportion to weights, and enforces -the constraits of minimum reservation and maximum limitation, so that -the services can compete for the resources fairly. Currently the -*mclock_opclass* operation queue divides Ceph services involving I/O -resources into following buckets: - -- client op: the iops issued by client -- osd subop: the iops issued by primary OSD -- snap trim: the snap trimming related requests -- pg recovery: the recovery related requests -- pg scrub: the scrub related requests - -And the resources are partitioned using following three sets of tags. In other -words, the share of each type of service is controlled by three tags: - -#. reservation: the minimum IOPS allocated for the service. -#. limitation: the maximum IOPS allocated for the service. -#. weight: the proportional share of capacity if extra capacity or system - oversubscribed. - -In Ceph operations are graded with "cost". And the resources allocated -for serving various services are consumed by these "costs". So, for -example, the more reservation a services has, the more resource it is -guaranteed to possess, as long as it requires. Assuming there are 2 -services: recovery and client ops: - -- recovery: (r:1, l:5, w:1) -- client ops: (r:2, l:0, w:9) - -The settings above ensure that the recovery won't get more than 5 -requests per second serviced, even if it requires so (see CURRENT -IMPLEMENTATION NOTE below), and no other services are competing with -it. But if the clients start to issue large amount of I/O requests, -neither will they exhaust all the I/O resources. 1 request per second -is always allocated for recovery jobs as long as there are any such -requests. So the recovery jobs won't be starved even in a cluster with -high load. And in the meantime, the client ops can enjoy a larger -portion of the I/O resource, because its weight is "9", while its -competitor "1". In the case of client ops, it is not clamped by the -limit setting, so it can make use of all the resources if there is no -recovery ongoing. - -Along with *mclock_opclass* another mclock operation queue named -*mclock_client* is available. It divides operations based on category -but also divides them based on the client making the request. This -helps not only manage the distribution of resources spent on different -classes of operations but also tries to insure fairness among clients. - -CURRENT IMPLEMENTATION NOTE: the current experimental implementation -does not enforce the limit values. As a first approximation we decided -not to prevent operations that would otherwise enter the operation -sequencer from doing so. - -Subtleties of mClock -```````````````````` - -The reservation and limit values have a unit of requests per -second. The weight, however, does not technically have a unit and the -weights are relative to one another. So if one class of requests has a -weight of 1 and another a weight of 9, then the latter class of -requests should get 9 executed at a 9 to 1 ratio as the first class. -However that will only happen once the reservations are met and those -values include the operations executed under the reservation phase. - -Even though the weights do not have units, one must be careful in -choosing their values due how the algorithm assigns weight tags to -requests. If the weight is *W*, then for a given class of requests, -the next one that comes in will have a weight tag of *1/W* plus the -previous weight tag or the current time, whichever is larger. That -means if *W* is sufficiently large and therefore *1/W* is sufficiently -small, the calculated tag may never be assigned as it will get a value -of the current time. The ultimate lesson is that values for weight -should not be too large. They should be under the number of requests -one expects to ve serviced each second. - -Caveats -``````` - -There are some factors that can reduce the impact of the mClock op -queues within Ceph. First, requests to an OSD are sharded by their -placement group identifier. Each shard has its own mClock queue and -these queues neither interact nor share information among them. The -number of shards can be controlled with the configuration options -``osd_op_num_shards``, ``osd_op_num_shards_hdd``, and -``osd_op_num_shards_ssd``. A lower number of shards will increase the -impact of the mClock queues, but may have other deliterious effects. - -Second, requests are transferred from the operation queue to the -operation sequencer, in which they go through the phases of -execution. The operation queue is where mClock resides and mClock -determines the next op to transfer to the operation sequencer. The -number of operations allowed in the operation sequencer is a complex -issue. In general we want to keep enough operations in the sequencer -so it's always getting work done on some operations while it's waiting -for disk and network access to complete on other operations. On the -other hand, once an operation is transferred to the operation -sequencer, mClock no longer has control over it. Therefore to maximize -the impact of mClock, we want to keep as few operations in the -operation sequencer as possible. So we have an inherent tension. - -The configuration options that influence the number of operations in -the operation sequencer are ``bluestore_throttle_bytes``, -``bluestore_throttle_deferred_bytes``, -``bluestore_throttle_cost_per_io``, -``bluestore_throttle_cost_per_io_hdd``, and -``bluestore_throttle_cost_per_io_ssd``. - -A third factor that affects the impact of the mClock algorithm is that -we're using a distributed system, where requests are made to multiple -OSDs and each OSD has (can have) multiple shards. Yet we're currently -using the mClock algorithm, which is not distributed (note: dmClock is -the distributed version of mClock). - -Various organizations and individuals are currently experimenting with -mClock as it exists in this code base along with their modifications -to the code base. We hope you'll share you're experiences with your -mClock and dmClock experiments in the ceph-devel mailing list. - - -``osd push per object cost`` - -:Description: the overhead for serving a push op - -:Type: Unsigned Integer -:Default: 1000 - -``osd recovery max chunk`` - -:Description: the maximum total size of data chunks a recovery op can carry. - -:Type: Unsigned Integer -:Default: 8 MiB - - -``osd op queue mclock client op res`` - -:Description: the reservation of client op. - -:Type: Float -:Default: 1000.0 - - -``osd op queue mclock client op wgt`` - -:Description: the weight of client op. - -:Type: Float -:Default: 500.0 - - -``osd op queue mclock client op lim`` - -:Description: the limit of client op. - -:Type: Float -:Default: 1000.0 - - -``osd op queue mclock osd subop res`` - -:Description: the reservation of osd subop. - -:Type: Float -:Default: 1000.0 - - -``osd op queue mclock osd subop wgt`` - -:Description: the weight of osd subop. - -:Type: Float -:Default: 500.0 - - -``osd op queue mclock osd subop lim`` - -:Description: the limit of osd subop. - -:Type: Float -:Default: 0.0 - - -``osd op queue mclock snap res`` - -:Description: the reservation of snap trimming. - -:Type: Float -:Default: 0.0 - - -``osd op queue mclock snap wgt`` - -:Description: the weight of snap trimming. - -:Type: Float -:Default: 1.0 - - -``osd op queue mclock snap lim`` - -:Description: the limit of snap trimming. - -:Type: Float -:Default: 0.001 - - -``osd op queue mclock recov res`` - -:Description: the reservation of recovery. - -:Type: Float -:Default: 0.0 - - -``osd op queue mclock recov wgt`` - -:Description: the weight of recovery. - -:Type: Float -:Default: 1.0 - - -``osd op queue mclock recov lim`` - -:Description: the limit of recovery. - -:Type: Float -:Default: 0.001 - - -``osd op queue mclock scrub res`` - -:Description: the reservation of scrub jobs. - -:Type: Float -:Default: 0.0 - - -``osd op queue mclock scrub wgt`` - -:Description: the weight of scrub jobs. - -:Type: Float -:Default: 1.0 - - -``osd op queue mclock scrub lim`` - -:Description: the limit of scrub jobs. - -:Type: Float -:Default: 0.001 - -.. _the dmClock algorithm: https://www.usenix.org/legacy/event/osdi10/tech/full_papers/Gulati.pdf - - -.. index:: OSD; backfilling - -Backfilling -=========== - -When you add or remove Ceph OSD Daemons to a cluster, the CRUSH algorithm will -want to rebalance the cluster by moving placement groups to or from Ceph OSD -Daemons to restore the balance. The process of migrating placement groups and -the objects they contain can reduce the cluster's operational performance -considerably. To maintain operational performance, Ceph performs this migration -with 'backfilling', which allows Ceph to set backfill operations to a lower -priority than requests to read or write data. - - -``osd max backfills`` - -:Description: The maximum number of backfills allowed to or from a single OSD. -:Type: 64-bit Unsigned Integer -:Default: ``1`` - - -``osd backfill scan min`` - -:Description: The minimum number of objects per backfill scan. - -:Type: 32-bit Integer -:Default: ``64`` - - -``osd backfill scan max`` - -:Description: The maximum number of objects per backfill scan. - -:Type: 32-bit Integer -:Default: ``512`` - - -``osd backfill retry interval`` - -:Description: The number of seconds to wait before retrying backfill requests. -:Type: Double -:Default: ``10.0`` - -.. index:: OSD; osdmap - -OSD Map -======= - -OSD maps reflect the OSD daemons operating in the cluster. Over time, the -number of map epochs increases. Ceph provides some settings to ensure that -Ceph performs well as the OSD map grows larger. - - -``osd map dedup`` - -:Description: Enable removing duplicates in the OSD map. -:Type: Boolean -:Default: ``true`` - - -``osd map cache size`` - -:Description: The number of OSD maps to keep cached. -:Type: 32-bit Integer -:Default: ``500`` - - -``osd map cache bl size`` - -:Description: The size of the in-memory OSD map cache in OSD daemons. -:Type: 32-bit Integer -:Default: ``50`` - - -``osd map cache bl inc size`` - -:Description: The size of the in-memory OSD map cache incrementals in - OSD daemons. - -:Type: 32-bit Integer -:Default: ``100`` - - -``osd map message max`` - -:Description: The maximum map entries allowed per MOSDMap message. -:Type: 32-bit Integer -:Default: ``100`` - - - -.. index:: OSD; recovery - -Recovery -======== - -When the cluster starts or when a Ceph OSD Daemon crashes and restarts, the OSD -begins peering with other Ceph OSD Daemons before writes can occur. See -`Monitoring OSDs and PGs`_ for details. - -If a Ceph OSD Daemon crashes and comes back online, usually it will be out of -sync with other Ceph OSD Daemons containing more recent versions of objects in -the placement groups. When this happens, the Ceph OSD Daemon goes into recovery -mode and seeks to get the latest copy of the data and bring its map back up to -date. Depending upon how long the Ceph OSD Daemon was down, the OSD's objects -and placement groups may be significantly out of date. Also, if a failure domain -went down (e.g., a rack), more than one Ceph OSD Daemon may come back online at -the same time. This can make the recovery process time consuming and resource -intensive. - -To maintain operational performance, Ceph performs recovery with limitations on -the number recovery requests, threads and object chunk sizes which allows Ceph -perform well in a degraded state. - - -``osd recovery delay start`` - -:Description: After peering completes, Ceph will delay for the specified number - of seconds before starting to recover objects. - -:Type: Float -:Default: ``0`` - - -``osd recovery max active`` - -:Description: The number of active recovery requests per OSD at one time. More - requests will accelerate recovery, but the requests places an - increased load on the cluster. - -:Type: 32-bit Integer -:Default: ``3`` - - -``osd recovery max chunk`` - -:Description: The maximum size of a recovered chunk of data to push. -:Type: 64-bit Unsigned Integer -:Default: ``8 << 20`` - - -``osd recovery max single start`` - -:Description: The maximum number of recovery operations per OSD that will be - newly started when an OSD is recovering. -:Type: 64-bit Unsigned Integer -:Default: ``1`` - - -``osd recovery thread timeout`` - -:Description: The maximum time in seconds before timing out a recovery thread. -:Type: 32-bit Integer -:Default: ``30`` - - -``osd recover clone overlap`` - -:Description: Preserves clone overlap during recovery. Should always be set - to ``true``. - -:Type: Boolean -:Default: ``true`` - - -``osd recovery sleep`` - -:Description: Time in seconds to sleep before next recovery or backfill op. - Increasing this value will slow down recovery operation while - client operations will be less impacted. - -:Type: Float -:Default: ``0`` - - -``osd recovery sleep hdd`` - -:Description: Time in seconds to sleep before next recovery or backfill op - for HDDs. - -:Type: Float -:Default: ``0.1`` - - -``osd recovery sleep ssd`` - -:Description: Time in seconds to sleep before next recovery or backfill op - for SSDs. - -:Type: Float -:Default: ``0`` - - -``osd recovery sleep hybrid`` - -:Description: Time in seconds to sleep before next recovery or backfill op - when osd data is on HDD and osd journal is on SSD. - -:Type: Float -:Default: ``0.025`` - -Tiering -======= - -``osd agent max ops`` - -:Description: The maximum number of simultaneous flushing ops per tiering agent - in the high speed mode. -:Type: 32-bit Integer -:Default: ``4`` - - -``osd agent max low ops`` - -:Description: The maximum number of simultaneous flushing ops per tiering agent - in the low speed mode. -:Type: 32-bit Integer -:Default: ``2`` - -See `cache target dirty high ratio`_ for when the tiering agent flushes dirty -objects within the high speed mode. - -Miscellaneous -============= - - -``osd snap trim thread timeout`` - -:Description: The maximum time in seconds before timing out a snap trim thread. -:Type: 32-bit Integer -:Default: ``60*60*1`` - - -``osd backlog thread timeout`` - -:Description: The maximum time in seconds before timing out a backlog thread. -:Type: 32-bit Integer -:Default: ``60*60*1`` - - -``osd default notify timeout`` - -:Description: The OSD default notification timeout (in seconds). -:Type: 32-bit Unsigned Integer -:Default: ``30`` - - -``osd check for log corruption`` - -:Description: Check log files for corruption. Can be computationally expensive. -:Type: Boolean -:Default: ``false`` - - -``osd remove thread timeout`` - -:Description: The maximum time in seconds before timing out a remove OSD thread. -:Type: 32-bit Integer -:Default: ``60*60`` - - -``osd command thread timeout`` - -:Description: The maximum time in seconds before timing out a command thread. -:Type: 32-bit Integer -:Default: ``10*60`` - - -``osd command max records`` - -:Description: Limits the number of lost objects to return. -:Type: 32-bit Integer -:Default: ``256`` - - -``osd auto upgrade tmap`` - -:Description: Uses ``tmap`` for ``omap`` on old objects. -:Type: Boolean -:Default: ``true`` - - -``osd tmapput sets users tmap`` - -:Description: Uses ``tmap`` for debugging only. -:Type: Boolean -:Default: ``false`` - - -``osd fast fail on connection refused`` - -:Description: If this option is enabled, crashed OSDs are marked down - immediately by connected peers and MONs (assuming that the - crashed OSD host survives). Disable it to restore old - behavior, at the expense of possible long I/O stalls when - OSDs crash in the middle of I/O operations. -:Type: Boolean -:Default: ``true`` - - - -.. _pool: ../../operations/pools -.. _Configuring Monitor/OSD Interaction: ../mon-osd-interaction -.. _Monitoring OSDs and PGs: ../../operations/monitoring-osd-pg#peering -.. _Pool & PG Config Reference: ../pool-pg-config-ref -.. _Journal Config Reference: ../journal-ref -.. _cache target dirty high ratio: ../../operations/pools#cache-target-dirty-high-ratio