src/ceph/doc/rados/operations/crush-map-edits.rst

   1 Manually editing a CRUSH Map
   2 ============================
   3
   4 .. note:: Manually editing the CRUSH map is considered an advanced
   5           administrator operation.  All CRUSH changes that are
   6           necessary for the overwhelming majority of installations are
   7           possible via the standard ceph CLI and do not require manual
   8           CRUSH map edits.  If you have identified a use case where
   9           manual edits *are* necessary, consider contacting the Ceph
  10           developers so that future versions of Ceph can make this
  11           unnecessary.
  12
  13 To edit an existing CRUSH map:
  14
  15 #. `Get the CRUSH map`_.
  16 #. `Decompile`_ the CRUSH map.
  17 #. Edit at least one of `Devices`_, `Buckets`_ and `Rules`_.
  18 #. `Recompile`_ the CRUSH map.
  19 #. `Set the CRUSH map`_.
  20
  21 To activate CRUSH map rules for a specific pool, identify the common ruleset
  22 number for those rules and specify that ruleset number for the pool. See `Set
  23 Pool Values`_ for details.
  24
  25 .. _Get the CRUSH map: #getcrushmap
  26 .. _Decompile: #decompilecrushmap
  27 .. _Devices: #crushmapdevices
  28 .. _Buckets: #crushmapbuckets
  29 .. _Rules: #crushmaprules
  30 .. _Recompile: #compilecrushmap
  31 .. _Set the CRUSH map: #setcrushmap
  32 .. _Set Pool Values: ../pools#setpoolvalues
  33
  34 .. _getcrushmap:
  35
  36 Get a CRUSH Map
  37 ---------------
  38
  39 To get the CRUSH map for your cluster, execute the following::
  40
  41         ceph osd getcrushmap -o {compiled-crushmap-filename}
  42
  43 Ceph will output (-o) a compiled CRUSH map to the filename you specified. Since
  44 the CRUSH map is in a compiled form, you must decompile it first before you can
  45 edit it.
  46
  47 .. _decompilecrushmap:
  48
  49 Decompile a CRUSH Map
  50 ---------------------
  51
  52 To decompile a CRUSH map, execute the following::
  53
  54         crushtool -d {compiled-crushmap-filename} -o {decompiled-crushmap-filename}
  55
  56
  57 Sections
  58 --------
  59
  60 There are six main sections to a CRUSH Map.
  61
  62 #. **tunables:** The preamble at the top of the map described any *tunables*
  63    for CRUSH behavior that vary from the historical/legacy CRUSH behavior. These
  64    correct for old bugs, optimizations, or other changes in behavior that have
  65    been made over the years to improve CRUSH's behavior.
  66
  67 #. **devices:** Devices are individual ``ceph-osd`` daemons that can
  68    store data.
  69
  70 #. **types**: Bucket ``types`` define the types of buckets used in
  71    your CRUSH hierarchy. Buckets consist of a hierarchical aggregation
  72    of storage locations (e.g., rows, racks, chassis, hosts, etc.) and
  73    their assigned weights.
  74
  75 #. **buckets:** Once you define bucket types, you must define each node
  76    in the hierarchy, its type, and which devices or other nodes it
  77    containes.
  78
  79 #. **rules:** Rules define policy about how data is distributed across
  80    devices in the hierarchy.
  81
  82 #. **choose_args:** Choose_args are alternative weights associated with
  83    the hierarchy that have been adjusted to optimize data placement.  A single
  84    choose_args map can be used for the entire cluster, or one can be
  85    created for each individual pool.
  86
  87
  88 .. _crushmapdevices:
  89
  90 CRUSH Map Devices
  91 -----------------
  92
  93 Devices are individual ``ceph-osd`` daemons that can store data.  You
  94 will normally have one defined here for each OSD daemon in your
  95 cluster.  Devices are identified by an id (a non-negative integer) and
  96 a name, normally ``osd.N`` where ``N`` is the device id.
  97
  98 Devices may also have a *device class* associated with them (e.g.,
  99 ``hdd`` or ``ssd``), allowing them to be conveniently targetted by a
 100 crush rule.
 101
 102 ::
 103
 104         # devices
 105         device {num} {osd.name} [class {class}]
 106
 107 For example::
 108
 109         # devices
 110         device 0 osd.0 class ssd
 111         device 1 osd.1 class hdd
 112         device 2 osd.2
 113         device 3 osd.3
 114
 115 In most cases, each device maps to a single ``ceph-osd`` daemon.  This
 116 is normally a single storage device, a pair of devices (for example,
 117 one for data and one for a journal or metadata), or in some cases a
 118 small RAID device.
 119
 120
 121
 122
 123
 124 CRUSH Map Bucket Types
 125 ----------------------
 126
 127 The second list in the CRUSH map defines 'bucket' types. Buckets facilitate
 128 a hierarchy of nodes and leaves. Node (or non-leaf) buckets typically represent
 129 physical locations in a hierarchy. Nodes aggregate other nodes or leaves.
 130 Leaf buckets represent ``ceph-osd`` daemons and their corresponding storage
 131 media.
 132
 133 .. tip:: The term "bucket" used in the context of CRUSH means a node in
 134    the hierarchy, i.e. a location or a piece of physical hardware. It
 135    is a different concept from the term "bucket" when used in the
 136    context of RADOS Gateway APIs.
 137
 138 To add a bucket type to the CRUSH map, create a new line under your list of
 139 bucket types. Enter ``type`` followed by a unique numeric ID and a bucket name.
 140 By convention, there is one leaf bucket and it is ``type 0``;  however, you may
 141 give it any name you like (e.g., osd, disk, drive, storage, etc.)::
 142
 143         #types
 144         type {num} {bucket-name}
 145
 146 For example::
 147
 148         # types
 149         type 0 osd
 150         type 1 host
 151         type 2 chassis
 152         type 3 rack
 153         type 4 row
 154         type 5 pdu
 155         type 6 pod
 156         type 7 room
 157         type 8 datacenter
 158         type 9 region
 159         type 10 root
 160
 161
 162
 163 .. _crushmapbuckets:
 164
 165 CRUSH Map Bucket Hierarchy
 166 --------------------------
 167
 168 The CRUSH algorithm distributes data objects among storage devices according
 169 to a per-device weight value, approximating a uniform probability distribution.
 170 CRUSH distributes objects and their replicas according to the hierarchical
 171 cluster map you define. Your CRUSH map represents the available storage
 172 devices and the logical elements that contain them.
 173
 174 To map placement groups to OSDs across failure domains, a CRUSH map defines a
 175 hierarchical list of bucket types (i.e., under ``#types`` in the generated CRUSH
 176 map). The purpose of creating a bucket hierarchy is to segregate the
 177 leaf nodes by their failure domains, such as hosts, chassis, racks, power
 178 distribution units, pods, rows, rooms, and data centers. With the exception of
 179 the leaf nodes representing OSDs, the rest of the hierarchy is arbitrary, and
 180 you may define it according to your own needs.
 181
 182 We recommend adapting your CRUSH map to your firms's hardware naming conventions
 183 and using instances names that reflect the physical hardware. Your naming
 184 practice can make it easier to administer the cluster and troubleshoot
 185 problems when an OSD and/or other hardware malfunctions and the administrator
 186 need access to physical hardware.
 187
 188 In the following example, the bucket hierarchy has a leaf bucket named ``osd``,
 189 and two node buckets named ``host`` and ``rack`` respectively.
 190
 191 .. ditaa::
 192                            +-----------+
 193                            | {o}rack   |
 194                            |   Bucket  |
 195                            +-----+-----+
 196                                  |
 197                  +---------------+---------------+
 198                  |                               |
 199            +-----+-----+                   +-----+-----+
 200            | {o}host   |                   | {o}host   |
 201            |   Bucket  |                   |   Bucket  |
 202            +-----+-----+                   +-----+-----+
 203                  |                               |
 204          +-------+-------+               +-------+-------+
 205          |               |               |               |
 206    +-----+-----+   +-----+-----+   +-----+-----+   +-----+-----+
 207    |    osd    |   |    osd    |   |    osd    |   |    osd    |
 208    |   Bucket  |   |   Bucket  |   |   Bucket  |   |   Bucket  |
 209    +-----------+   +-----------+   +-----------+   +-----------+
 210
 211 .. note:: The higher numbered ``rack`` bucket type aggregates the lower
 212    numbered ``host`` bucket type.
 213
 214 Since leaf nodes reflect storage devices declared under the ``#devices`` list
 215 at the beginning of the CRUSH map, you do not need to declare them as bucket
 216 instances. The second lowest bucket type in your hierarchy usually aggregates
 217 the devices (i.e., it's usually the computer containing the storage media, and
 218 uses whatever term you prefer to describe it, such as  "node", "computer",
 219 "server," "host", "machine", etc.). In high density environments, it is
 220 increasingly common to see multiple hosts/nodes per chassis. You should account
 221 for chassis failure too--e.g., the need to pull a chassis if a node fails may
 222 result in bringing down numerous hosts/nodes and their OSDs.
 223
 224 When declaring a bucket instance, you must specify its type, give it a unique
 225 name (string), assign it a unique ID expressed as a negative integer (optional),
 226 specify a weight relative to the total capacity/capability of its item(s),
 227 specify the bucket algorithm (usually ``straw``), and the hash (usually ``0``,
 228 reflecting hash algorithm ``rjenkins1``). A bucket may have one or more items.
 229 The items may consist of node buckets or leaves. Items may have a weight that
 230 reflects the relative weight of the item.
 231
 232 You may declare a node bucket with the following syntax::
 233
 234         [bucket-type] [bucket-name] {
 235                 id [a unique negative numeric ID]
 236                 weight [the relative capacity/capability of the item(s)]
 237                 alg [the bucket type: uniform | list | tree | straw ]
 238                 hash [the hash type: 0 by default]
 239                 item [item-name] weight [weight]
 240         }
 241
 242 For example, using the diagram above, we would define two host buckets
 243 and one rack bucket. The OSDs are declared as items within the host buckets::
 244
 245         host node1 {
 246                 id -1
 247                 alg straw
 248                 hash 0
 249                 item osd.0 weight 1.00
 250                 item osd.1 weight 1.00
 251         }
 252
 253         host node2 {
 254                 id -2
 255                 alg straw
 256                 hash 0
 257                 item osd.2 weight 1.00
 258                 item osd.3 weight 1.00
 259         }
 260
 261         rack rack1 {
 262                 id -3
 263                 alg straw
 264                 hash 0
 265                 item node1 weight 2.00
 266                 item node2 weight 2.00
 267         }
 268
 269 .. note:: In the foregoing example, note that the rack bucket does not contain
 270    any OSDs. Rather it contains lower level host buckets, and includes the
 271    sum total of their weight in the item entry.
 272
 273 .. topic:: Bucket Types
 274
 275    Ceph supports four bucket types, each representing a tradeoff between
 276    performance and reorganization efficiency. If you are unsure of which bucket
 277    type to use, we recommend using a ``straw`` bucket.  For a detailed
 278    discussion of bucket types, refer to
 279    `CRUSH - Controlled, Scalable, Decentralized Placement of Replicated Data`_,
 280    and more specifically to **Section 3.4**. The bucket types are:
 281
 282         #. **Uniform:** Uniform buckets aggregate devices with **exactly** the same
 283            weight. For example, when firms commission or decommission hardware, they
 284            typically do so with many machines that have exactly the same physical
 285            configuration (e.g., bulk purchases). When storage devices have exactly
 286            the same weight, you may use the ``uniform`` bucket type, which allows
 287            CRUSH to map replicas into uniform buckets in constant time. With
 288            non-uniform weights, you should use another bucket algorithm.
 289
 290         #. **List**: List buckets aggregate their content as linked lists. Based on
 291            the :abbr:`RUSH (Replication Under Scalable Hashing)` :sub:`P` algorithm,
 292            a list is a natural and intuitive choice for an **expanding cluster**:
 293            either an object is relocated to the newest device with some appropriate
 294            probability, or it remains on the older devices as before. The result is
 295            optimal data migration when items are added to the bucket. Items removed
 296            from the middle or tail of the list, however, can result in a signiﬁcant
 297            amount of unnecessary movement, making list buckets most suitable for
 298            circumstances in which they **never (or very rarely) shrink**.
 299
 300         #. **Tree**: Tree buckets use a binary search tree. They are more efficient
 301            than list buckets when a bucket contains a larger set of items. Based on
 302            the :abbr:`RUSH (Replication Under Scalable Hashing)` :sub:`R` algorithm,
 303            tree buckets reduce the placement time to O(log :sub:`n`), making them
 304            suitable for managing much larger sets of devices or nested buckets.
 305
 306         #. **Straw:** List and Tree buckets use a divide and conquer strategy
 307            in a way that either gives certain items precedence (e.g., those
 308            at the beginning of a list) or obviates the need to consider entire
 309            subtrees of items at all. That improves the performance of the replica
 310            placement process, but can also introduce suboptimal reorganization
 311            behavior when the contents of a bucket change due an addition, removal,
 312            or re-weighting of an item. The straw bucket type allows all items to
 313            fairly “compete” against each other for replica placement through a
 314            process analogous to a draw of straws.
 315
 316 .. topic:: Hash
 317
 318    Each bucket uses a hash algorithm. Currently, Ceph supports ``rjenkins1``.
 319    Enter ``0`` as your hash setting to select ``rjenkins1``.
 320
 321
 322 .. _weightingbucketitems:
 323
 324 .. topic:: Weighting Bucket Items
 325
 326    Ceph expresses bucket weights as doubles, which allows for fine
 327    weighting. A weight is the relative difference between device capacities. We
 328    recommend using ``1.00`` as the relative weight for a 1TB storage device.
 329    In such a scenario, a weight of ``0.5`` would represent approximately 500GB,
 330    and a weight of ``3.00`` would represent approximately 3TB. Higher level
 331    buckets have a weight that is the sum total of the leaf items aggregated by
 332    the bucket.
 333
 334    A bucket item weight is one dimensional, but you may also calculate your
 335    item weights to reflect the performance of the storage drive. For example,
 336    if you have many 1TB drives where some have relatively low data transfer
 337    rate and the others have a relatively high data transfer rate, you may
 338    weight them differently, even though they have the same capacity (e.g.,
 339    a weight of 0.80 for the first set of drives with lower total throughput,
 340    and 1.20 for the second set of drives with higher total throughput).
 341
 342
 343 .. _crushmaprules:
 344
 345 CRUSH Map Rules
 346 ---------------
 347
 348 CRUSH maps support the notion of 'CRUSH rules', which are the rules that
 349 determine data placement for a pool. For large clusters, you will likely create
 350 many pools where each pool may have its own CRUSH ruleset and rules. The default
 351 CRUSH map has a rule for each pool, and one ruleset assigned to each of the
 352 default pools.
 353
 354 .. note:: In most cases, you will not need to modify the default rules. When
 355    you create a new pool, its default ruleset is ``0``.
 356
 357
 358 CRUSH rules define placement and replication strategies or distribution policies
 359 that allow you to specify exactly how CRUSH places object replicas. For
 360 example, you might create a rule selecting a pair of targets for 2-way
 361 mirroring, another rule for selecting three targets in two different data
 362 centers for 3-way mirroring, and yet another rule for erasure coding over six
 363 storage devices. For a detailed discussion of CRUSH rules, refer to
 364 `CRUSH - Controlled, Scalable, Decentralized Placement of Replicated Data`_,
 365 and more specifically to **Section 3.2**.
 366
 367 A rule takes the following form::
 368
 369         rule <rulename> {
 370
 371                 ruleset <ruleset>
 372                 type [ replicated | erasure ]
 373                 min_size <min-size>
 374                 max_size <max-size>
 375                 step take <bucket-name> [class <device-class>]
 376                 step [choose|chooseleaf] [firstn|indep] <N> <bucket-type>
 377                 step emit
 378         }
 379
 380
 381 ``ruleset``
 382
 383 :Description: A means of classifying a rule as belonging to a set of rules.
 384               Activated by `setting the ruleset in a pool`_.
 385
 386 :Purpose: A component of the rule mask.
 387 :Type: Integer
 388 :Required: Yes
 389 :Default: 0
 390
 391 .. _setting the ruleset in a pool: ../pools#setpoolvalues
 392
 393
 394 ``type``
 395
 396 :Description: Describes a rule for either a storage drive (replicated)
 397               or a RAID.
 398
 399 :Purpose: A component of the rule mask.
 400 :Type: String
 401 :Required: Yes
 402 :Default: ``replicated``
 403 :Valid Values: Currently only ``replicated`` and ``erasure``
 404
 405 ``min_size``
 406
 407 :Description: If a pool makes fewer replicas than this number, CRUSH will
 408               **NOT** select this rule.
 409
 410 :Type: Integer
 411 :Purpose: A component of the rule mask.
 412 :Required: Yes
 413 :Default: ``1``
 414
 415 ``max_size``
 416
 417 :Description: If a pool makes more replicas than this number, CRUSH will
 418               **NOT** select this rule.
 419
 420 :Type: Integer
 421 :Purpose: A component of the rule mask.
 422 :Required: Yes
 423 :Default: 10
 424
 425
 426 ``step take <bucket-name> [class <device-class>]``
 427
 428 :Description: Takes a bucket name, and begins iterating down the tree.
 429               If the ``device-class`` is specified, it must match
 430               a class previously used when defining a device. All
 431               devices that do not belong to the class are excluded.
 432 :Purpose: A component of the rule.
 433 :Required: Yes
 434 :Example: ``step take data``
 435
 436
 437 ``step choose firstn {num} type {bucket-type}``
 438
 439 :Description: Selects the number of buckets of the given type. The number is
 440               usually the number of replicas in the pool (i.e., pool size).
 441
 442               - If ``{num} == 0``, choose ``pool-num-replicas`` buckets (all available).
 443               - If ``{num} > 0 && < pool-num-replicas``, choose that many buckets.
 444               - If ``{num} < 0``, it means ``pool-num-replicas - {num}``.
 445
 446 :Purpose: A component of the rule.
 447 :Prerequisite: Follows ``step take`` or ``step choose``.
 448 :Example: ``step choose firstn 1 type row``
 449
 450
 451 ``step chooseleaf firstn {num} type {bucket-type}``
 452
 453 :Description: Selects a set of buckets of ``{bucket-type}`` and chooses a leaf
 454               node from the subtree of each bucket in the set of buckets. The
 455               number of buckets in the set is usually the number of replicas in
 456               the pool (i.e., pool size).
 457
 458               - If ``{num} == 0``, choose ``pool-num-replicas`` buckets (all available).
 459               - If ``{num} > 0 && < pool-num-replicas``, choose that many buckets.
 460               - If ``{num} < 0``, it means ``pool-num-replicas - {num}``.
 461
 462 :Purpose: A component of the rule. Usage removes the need to select a device using two steps.
 463 :Prerequisite: Follows ``step take`` or ``step choose``.
 464 :Example: ``step chooseleaf firstn 0 type row``
 465
 466
 467
 468 ``step emit``
 469
 470 :Description: Outputs the current value and empties the stack. Typically used
 471               at the end of a rule, but may also be used to pick from different
 472               trees in the same rule.
 473
 474 :Purpose: A component of the rule.
 475 :Prerequisite: Follows ``step choose``.
 476 :Example: ``step emit``
 477
 478 .. important:: To activate one or more rules with a common ruleset number to a
 479    pool, set the ruleset number of the pool.
 480
 481
 482 Placing Different Pools on Different OSDS:
 483 ==========================================
 484
 485 Suppose you want to have most pools default to OSDs backed by large hard drives,
 486 but have some pools mapped to OSDs backed by fast solid-state drives (SSDs).
 487 It's possible to have multiple independent CRUSH hierarchies within the same
 488 CRUSH map. Define two hierarchies with two different root nodes--one for hard
 489 disks (e.g., "root platter") and one for SSDs (e.g., "root ssd") as shown
 490 below::
 491
 492   device 0 osd.0
 493   device 1 osd.1
 494   device 2 osd.2
 495   device 3 osd.3
 496   device 4 osd.4
 497   device 5 osd.5
 498   device 6 osd.6
 499   device 7 osd.7
 500
 501         host ceph-osd-ssd-server-1 {
 502                 id -1
 503                 alg straw
 504                 hash 0
 505                 item osd.0 weight 1.00
 506                 item osd.1 weight 1.00
 507         }
 508
 509         host ceph-osd-ssd-server-2 {
 510                 id -2
 511                 alg straw
 512                 hash 0
 513                 item osd.2 weight 1.00
 514                 item osd.3 weight 1.00
 515         }
 516
 517         host ceph-osd-platter-server-1 {
 518                 id -3
 519                 alg straw
 520                 hash 0
 521                 item osd.4 weight 1.00
 522                 item osd.5 weight 1.00
 523         }
 524
 525         host ceph-osd-platter-server-2 {
 526                 id -4
 527                 alg straw
 528                 hash 0
 529                 item osd.6 weight 1.00
 530                 item osd.7 weight 1.00
 531         }
 532
 533         root platter {
 534                 id -5
 535                 alg straw
 536                 hash 0
 537                 item ceph-osd-platter-server-1 weight 2.00
 538                 item ceph-osd-platter-server-2 weight 2.00
 539         }
 540
 541         root ssd {
 542                 id -6
 543                 alg straw
 544                 hash 0
 545                 item ceph-osd-ssd-server-1 weight 2.00
 546                 item ceph-osd-ssd-server-2 weight 2.00
 547         }
 548
 549         rule data {
 550                 ruleset 0
 551                 type replicated
 552                 min_size 2
 553                 max_size 2
 554                 step take platter
 555                 step chooseleaf firstn 0 type host
 556                 step emit
 557         }
 558
 559         rule metadata {
 560                 ruleset 1
 561                 type replicated
 562                 min_size 0
 563                 max_size 10
 564                 step take platter
 565                 step chooseleaf firstn 0 type host
 566                 step emit
 567         }
 568
 569         rule rbd {
 570                 ruleset 2
 571                 type replicated
 572                 min_size 0
 573                 max_size 10
 574                 step take platter
 575                 step chooseleaf firstn 0 type host
 576                 step emit
 577         }
 578
 579         rule platter {
 580                 ruleset 3
 581                 type replicated
 582                 min_size 0
 583                 max_size 10
 584                 step take platter
 585                 step chooseleaf firstn 0 type host
 586                 step emit
 587         }
 588
 589         rule ssd {
 590                 ruleset 4
 591                 type replicated
 592                 min_size 0
 593                 max_size 4
 594                 step take ssd
 595                 step chooseleaf firstn 0 type host
 596                 step emit
 597         }
 598
 599         rule ssd-primary {
 600                 ruleset 5
 601                 type replicated
 602                 min_size 5
 603                 max_size 10
 604                 step take ssd
 605                 step chooseleaf firstn 1 type host
 606                 step emit
 607                 step take platter
 608                 step chooseleaf firstn -1 type host
 609                 step emit
 610         }
 611
 612 You can then set a pool to use the SSD rule by::
 613
 614   ceph osd pool set <poolname> crush_ruleset 4
 615
 616 Similarly, using the ``ssd-primary`` rule will cause each placement group in the
 617 pool to be placed with an SSD as the primary and platters as the replicas.
 618
 619
 620 Tuning CRUSH, the hard way
 621 --------------------------
 622
 623 If you can ensure that all clients are running recent code, you can
 624 adjust the tunables by extracting the CRUSH map, modifying the values,
 625 and reinjecting it into the cluster.
 626
 627 * Extract the latest CRUSH map::
 628
 629         ceph osd getcrushmap -o /tmp/crush
 630
 631 * Adjust tunables.  These values appear to offer the best behavior
 632   for both large and small clusters we tested with.  You will need to
 633   additionally specify the ``--enable-unsafe-tunables`` argument to
 634   ``crushtool`` for this to work.  Please use this option with
 635   extreme care.::
 636
 637         crushtool -i /tmp/crush --set-choose-local-tries 0 --set-choose-local-fallback-tries 0 --set-choose-total-tries 50 -o /tmp/crush.new
 638
 639 * Reinject modified map::
 640
 641         ceph osd setcrushmap -i /tmp/crush.new
 642
 643 Legacy values
 644 -------------
 645
 646 For reference, the legacy values for the CRUSH tunables can be set
 647 with::
 648
 649    crushtool -i /tmp/crush --set-choose-local-tries 2 --set-choose-local-fallback-tries 5 --set-choose-total-tries 19 --set-chooseleaf-descend-once 0 --set-chooseleaf-vary-r 0 -o /tmp/crush.legacy
 650
 651 Again, the special ``--enable-unsafe-tunables`` option is required.
 652 Further, as noted above, be careful running old versions of the
 653 ``ceph-osd`` daemon after reverting to legacy values as the feature
 654 bit is not perfectly enforced.