X-Git-Url: https://gerrit.opnfv.org/gerrit/gitweb?a=blobdiff_plain;f=src%2Fceph%2Fdoc%2Fdev%2Fplacement-group.rst;fp=src%2Fceph%2Fdoc%2Fdev%2Fplacement-group.rst;h=0000000000000000000000000000000000000000;hb=7da45d65be36d36b880cc55c5036e96c24b53f00;hp=3c067ea3fe6f3b2ab3edcb989827f5d290b9479e;hpb=691462d09d0987b47e112d6ee8740375df3c51b2;p=stor4nfv.git diff --git a/src/ceph/doc/dev/placement-group.rst b/src/ceph/doc/dev/placement-group.rst deleted file mode 100644 index 3c067ea..0000000 --- a/src/ceph/doc/dev/placement-group.rst +++ /dev/null @@ -1,151 +0,0 @@ -============================ - PG (Placement Group) notes -============================ - -Miscellaneous copy-pastes from emails, when this gets cleaned up it -should move out of /dev. - -Overview -======== - -PG = "placement group". When placing data in the cluster, objects are -mapped into PGs, and those PGs are mapped onto OSDs. We use the -indirection so that we can group objects, which reduces the amount of -per-object metadata we need to keep track of and processes we need to -run (it would be prohibitively expensive to track eg the placement -history on a per-object basis). Increasing the number of PGs can -reduce the variance in per-OSD load across your cluster, but each PG -requires a bit more CPU and memory on the OSDs that are storing it. We -try and ballpark it at 100 PGs/OSD, although it can vary widely -without ill effects depending on your cluster. You hit a bug in how we -calculate the initial PG number from a cluster description. - -There are a couple of different categories of PGs; the 6 that exist -(in the original emailer's ``ceph -s`` output) are "local" PGs which -are tied to a specific OSD. However, those aren't actually used in a -standard Ceph configuration. - - -Mapping algorithm (simplified) -============================== - -| > How does the Object->PG mapping look like, do you map more than one object on -| > one PG, or do you sometimes map an object to more than one PG? How about the -| > mapping of PGs to OSDs, does one PG belong to exactly one OSD? -| > -| > Does one PG represent a fixed amount of storage space? - -Many objects map to one PG. - -Each object maps to exactly one PG. - -One PG maps to a single list of OSDs, where the first one in the list -is the primary and the rest are replicas. - -Many PGs can map to one OSD. - -A PG represents nothing but a grouping of objects; you configure the -number of PGs you want, number of OSDs * 100 is a good starting point -, and all of your stored objects are pseudo-randomly evenly distributed -to the PGs. So a PG explicitly does NOT represent a fixed amount of -storage; it represents 1/pg_num'th of the storage you happen to have -on your OSDs. - -Ignoring the finer points of CRUSH and custom placement, it goes -something like this in pseudocode:: - - locator = object_name - obj_hash = hash(locator) - pg = obj_hash % num_pg - OSDs_for_pg = crush(pg) # returns a list of OSDs - primary = osds_for_pg[0] - replicas = osds_for_pg[1:] - -If you want to understand the crush() part in the above, imagine a -perfectly spherical datacenter in a vacuum ;) that is, if all OSDs -have weight 1.0, and there is no topology to the data center (all OSDs -are on the top level), and you use defaults, etc, it simplifies to -consistent hashing; you can think of it as:: - - def crush(pg): - all_osds = ['osd.0', 'osd.1', 'osd.2', ...] - result = [] - # size is the number of copies; primary+replicas - while len(result) < size: - r = hash(pg) - chosen = all_osds[ r % len(all_osds) ] - if chosen in result: - # OSD can be picked only once - continue - result.append(chosen) - return result - -User-visible PG States -====================== - -.. todo:: diagram of states and how they can overlap - -*creating* - the PG is still being created - -*active* - requests to the PG will be processed - -*clean* - all objects in the PG are replicated the correct number of times - -*down* - a replica with necessary data is down, so the pg is offline - -*replay* - the PG is waiting for clients to replay operations after an OSD crashed - -*splitting* - the PG is being split into multiple PGs (not functional as of 2012-02) - -*scrubbing* - the PG is being checked for inconsistencies - -*degraded* - some objects in the PG are not replicated enough times yet - -*inconsistent* - replicas of the PG are not consistent (e.g. objects are - the wrong size, objects are missing from one replica *after* recovery - finished, etc.) - -*peering* - the PG is undergoing the :doc:`/dev/peering` process - -*repair* - the PG is being checked and any inconsistencies found will be repaired (if possible) - -*recovering* - objects are being migrated/synchronized with replicas - -*recovery_wait* - the PG is waiting for the local/remote recovery reservations - -*backfilling* - a special case of recovery, in which the entire contents of - the PG are scanned and synchronized, instead of inferring what - needs to be transferred from the PG logs of recent operations - -*backfill_wait* - the PG is waiting in line to start backfill - -*backfill_toofull* - backfill reservation rejected, OSD too full - -*incomplete* - a pg is missing a necessary period of history from its - log. If you see this state, report a bug, and try to start any - failed OSDs that may contain the needed information. - -*stale* - the PG is in an unknown state - the monitors have not received - an update for it since the PG mapping changed. - -*remapped* - the PG is temporarily mapped to a different set of OSDs from what - CRUSH specified