src/ceph/doc/cephfs/disaster-recovery.rst

   1
   2 Disaster recovery
   3 =================
   4
   5 .. danger::
   6
   7     The notes in this section are aimed at experts, making a best effort
   8     to recovery what they can from damaged filesystems.  These steps
   9     have the potential to make things worse as well as better.  If you
  10     are unsure, do not proceed.
  11
  12
  13 Journal export
  14 --------------
  15
  16 Before attempting dangerous operations, make a copy of the journal like so:
  17
  18 ::
  19
  20     cephfs-journal-tool journal export backup.bin
  21
  22 Note that this command may not always work if the journal is badly corrupted,
  23 in which case a RADOS-level copy should be made (http://tracker.ceph.com/issues/9902).
  24
  25
  26 Dentry recovery from journal
  27 ----------------------------
  28
  29 If a journal is damaged or for any reason an MDS is incapable of replaying it,
  30 attempt to recover what file metadata we can like so:
  31
  32 ::
  33
  34     cephfs-journal-tool event recover_dentries summary
  35
  36 This command by default acts on MDS rank 0, pass --rank=<n> to operate on other ranks.
  37
  38 This command will write any inodes/dentries recoverable from the journal
  39 into the backing store, if these inodes/dentries are higher-versioned
  40 than the previous contents of the backing store.  If any regions of the journal
  41 are missing/damaged, they will be skipped.
  42
  43 Note that in addition to writing out dentries and inodes, this command will update
  44 the InoTables of each 'in' MDS rank, to indicate that any written inodes' numbers
  45 are now in use.  In simple cases, this will result in an entirely valid backing
  46 store state.
  47
  48 .. warning::
  49
  50     The resulting state of the backing store is not guaranteed to be self-consistent,
  51     and an online MDS scrub will be required afterwards.  The journal contents
  52     will not be modified by this command, you should truncate the journal
  53     separately after recovering what you can.
  54
  55 Journal truncation
  56 ------------------
  57
  58 If the journal is corrupt or MDSs cannot replay it for any reason, you can
  59 truncate it like so:
  60
  61 ::
  62
  63     cephfs-journal-tool journal reset
  64
  65 .. warning::
  66
  67     Resetting the journal *will* lose metadata unless you have extracted
  68     it by other means such as ``recover_dentries``.  It is likely to leave
  69     some orphaned objects in the data pool.  It may result in re-allocation
  70     of already-written inodes, such that permissions rules could be violated.
  71
  72 MDS table wipes
  73 ---------------
  74
  75 After the journal has been reset, it may no longer be consistent with respect
  76 to the contents of the MDS tables (InoTable, SessionMap, SnapServer).
  77
  78 To reset the SessionMap (erase all sessions), use:
  79
  80 ::
  81
  82     cephfs-table-tool all reset session
  83
  84 This command acts on the tables of all 'in' MDS ranks.  Replace 'all' with an MDS
  85 rank to operate on that rank only.
  86
  87 The session table is the table most likely to need resetting, but if you know you
  88 also need to reset the other tables then replace 'session' with 'snap' or 'inode'.
  89
  90 MDS map reset
  91 -------------
  92
  93 Once the in-RADOS state of the filesystem (i.e. contents of the metadata pool)
  94 is somewhat recovered, it may be necessary to update the MDS map to reflect
  95 the contents of the metadata pool.  Use the following command to reset the MDS
  96 map to a single MDS:
  97
  98 ::
  99
 100     ceph fs reset <fs name> --yes-i-really-mean-it
 101
 102 Once this is run, any in-RADOS state for MDS ranks other than 0 will be ignored:
 103 as a result it is possible for this to result in data loss.
 104
 105 One might wonder what the difference is between 'fs reset' and 'fs remove; fs new'.  The
 106 key distinction is that doing a remove/new will leave rank 0 in 'creating' state, such
 107 that it would overwrite any existing root inode on disk and orphan any existing files.  In
 108 contrast, the 'reset' command will leave rank 0 in 'active' state such that the next MDS
 109 daemon to claim the rank will go ahead and use the existing in-RADOS metadata.
 110
 111 Recovery from missing metadata objects
 112 --------------------------------------
 113
 114 Depending on what objects are missing or corrupt, you may need to
 115 run various commands to regenerate default versions of the
 116 objects.
 117
 118 ::
 119
 120     # Session table
 121     cephfs-table-tool 0 reset session
 122     # SnapServer
 123     cephfs-table-tool 0 reset snap
 124     # InoTable
 125     cephfs-table-tool 0 reset inode
 126     # Journal
 127     cephfs-journal-tool --rank=0 journal reset
 128     # Root inodes ("/" and MDS directory)
 129     cephfs-data-scan init
 130
 131 Finally, you can regenerate metadata objects for missing files
 132 and directories based on the contents of a data pool.  This is
 133 a three-phase process.  First, scanning *all* objects to calculate
 134 size and mtime metadata for inodes.  Second, scanning the first
 135 object from every file to collect this metadata and inject it into
 136 the metadata pool. Third, checking inode linkages and fixing found
 137 errors.
 138
 139 ::
 140
 141     cephfs-data-scan scan_extents <data pool>
 142     cephfs-data-scan scan_inodes <data pool>
 143     cephfs-data-scan scan_links
 144
 145 'scan_extents' and 'scan_inodes' commands may take a *very long* time
 146 if there are many files or very large files in the data pool.
 147
 148 To accelerate the process, run multiple instances of the tool.
 149
 150 Decide on a number of workers, and pass each worker a number within
 151 the range 0-(worker_m - 1).
 152
 153 The example below shows how to run 4 workers simultaneously:
 154
 155 ::
 156
 157     # Worker 0
 158     cephfs-data-scan scan_extents --worker_n 0 --worker_m 4 <data pool>
 159     # Worker 1
 160     cephfs-data-scan scan_extents --worker_n 1 --worker_m 4 <data pool>
 161     # Worker 2
 162     cephfs-data-scan scan_extents --worker_n 2 --worker_m 4 <data pool>
 163     # Worker 3
 164     cephfs-data-scan scan_extents --worker_n 3 --worker_m 4 <data pool>
 165
 166     # Worker 0
 167     cephfs-data-scan scan_inodes --worker_n 0 --worker_m 4 <data pool>
 168     # Worker 1
 169     cephfs-data-scan scan_inodes --worker_n 1 --worker_m 4 <data pool>
 170     # Worker 2
 171     cephfs-data-scan scan_inodes --worker_n 2 --worker_m 4 <data pool>
 172     # Worker 3
 173     cephfs-data-scan scan_inodes --worker_n 3 --worker_m 4 <data pool>
 174
 175 It is **important** to ensure that all workers have completed the
 176 scan_extents phase before any workers enter the scan_inodes phase.
 177
 178 After completing the metadata recovery, you may want to run cleanup
 179 operation to delete ancillary data geneated during recovery.
 180
 181 ::
 182
 183     cephfs-data-scan cleanup <data pool>
 184
 185 Finding files affected by lost data PGs
 186 ---------------------------------------
 187
 188 Losing a data PG may affect many files.  Files are split into many objects,
 189 so identifying which files are affected by loss of particular PGs requires
 190 a full scan over all object IDs that may exist within the size of a file.
 191 This type of scan may be useful for identifying which files require
 192 restoring from a backup.
 193
 194 .. danger::
 195
 196     This command does not repair any metadata, so when restoring files in
 197     this case you must *remove* the damaged file, and replace it in order
 198     to have a fresh inode.  Do not overwrite damaged files in place.
 199
 200 If you know that objects have been lost from PGs, use the ``pg_files``
 201 subcommand to scan for files that may have been damaged as a result:
 202
 203 ::
 204
 205     cephfs-data-scan pg_files <path> <pg id> [<pg id>...]
 206
 207 For example, if you have lost data from PGs 1.4 and 4.5, and you would like
 208 to know which files under /home/bob might have been damaged:
 209
 210 ::
 211
 212     cephfs-data-scan pg_files /home/bob 1.4 4.5
 213
 214 The output will be a list of paths to potentially damaged files, one
 215 per line.
 216
 217 Note that this command acts as a normal CephFS client to find all the
 218 files in the filesystem and read their layouts, so the MDS must be
 219 up and running.
 220
 221 Using an alternate metadata pool for recovery
 222 ---------------------------------------------
 223
 224 .. warning::
 225
 226    There has not been extensive testing of this procedure. It should be
 227    undertaken with great care.
 228
 229 If an existing filesystem is damaged and inoperative, it is possible to create
 230 a fresh metadata pool and attempt to reconstruct the filesystem metadata
 231 into this new pool, leaving the old metadata in place. This could be used to
 232 make a safer attempt at recovery since the existing metadata pool would not be
 233 overwritten.
 234
 235 .. caution::
 236
 237    During this process, multiple metadata pools will contain data referring to
 238    the same data pool. Extreme caution must be exercised to avoid changing the
 239    data pool contents while this is the case. Once recovery is complete, the
 240    damaged metadata pool should be deleted.
 241
 242 To begin this process, first create the fresh metadata pool and initialize
 243 it with empty file system data structures:
 244
 245 ::
 246
 247     ceph fs flag set enable_multiple true --yes-i-really-mean-it
 248     ceph osd pool create recovery <pg-num> replicated <crush-ruleset-name>
 249     ceph fs new recovery-fs recovery <data pool> --allow-dangerous-metadata-overlay
 250     cephfs-data-scan init --force-init --filesystem recovery-fs --alternate-pool recovery
 251     ceph fs reset recovery-fs --yes-i-really-mean-it
 252     cephfs-table-tool recovery-fs:all reset session
 253     cephfs-table-tool recovery-fs:all reset snap
 254     cephfs-table-tool recovery-fs:all reset inode
 255
 256 Next, run the recovery toolset using the --alternate-pool argument to output
 257 results to the alternate pool:
 258
 259 ::
 260
 261     cephfs-data-scan scan_extents --alternate-pool recovery --filesystem <original filesystem name> <original data pool name>
 262     cephfs-data-scan scan_inodes --alternate-pool recovery --filesystem <original filesystem name> --force-corrupt --force-init <original data pool name>
 263     cephfs-data-scan scan_links --filesystem recovery-fs
 264
 265 If the damaged filesystem contains dirty journal data, it may be recovered next
 266 with:
 267
 268 ::
 269
 270     cephfs-journal-tool --rank=<original filesystem name>:0 event recover_dentries list --alternate-pool recovery
 271     cephfs-journal-tool --rank recovery-fs:0 journal reset --force
 272
 273 After recovery, some recovered directories will have incorrect statistics.
 274 Ensure the parameters mds_verify_scatter and mds_debug_scatterstat are set
 275 to false (the default) to prevent the MDS from checking the statistics, then
 276 run a forward scrub to repair them. Ensure you have an MDS running and issue:
 277
 278 ::
 279
 280     ceph daemon mds.a scrub_path / recursive repair