src/ceph/doc/cephfs/hadoop.rst

   1 ========================
   2 Using Hadoop with CephFS
   3 ========================
   4
   5 The Ceph file system can be used as a drop-in replacement for the Hadoop File
   6 System (HDFS). This page describes the installation and configuration process
   7 of using Ceph with Hadoop.
   8
   9 Dependencies
  10 ============
  11
  12 * CephFS Java Interface
  13 * Hadoop CephFS Plugin
  14
  15 .. important:: Currently requires Hadoop 1.1.X stable series
  16
  17 Installation
  18 ============
  19
  20 There are three requirements for using CephFS with Hadoop. First, a running
  21 Ceph installation is required. The details of setting up a Ceph cluster and
  22 the file system are beyond the scope of this document. Please refer to the
  23 Ceph documentation for installing Ceph.
  24
  25 The remaining two requirements are a Hadoop installation, and the Ceph file
  26 system Java packages, including the Java CephFS Hadoop plugin. The high-level
  27 steps are two add the dependencies to the Hadoop installation ``CLASSPATH``,
  28 and configure Hadoop to use the Ceph file system.
  29
  30 CephFS Java Packages
  31 --------------------
  32
  33 * CephFS Hadoop plugin (`hadoop-cephfs.jar <http://ceph.com/download/hadoop-cephfs.jar>`_)
  34
  35 Adding these dependencies to a Hadoop installation will depend on your
  36 particular deployment. In general the dependencies must be present on each
  37 node in the system that will be part of the Hadoop cluster, and must be in the
  38 ``CLASSPATH`` searched for by Hadoop. Typically approaches are to place the
  39 additional ``jar`` files into the ``hadoop/lib`` directory, or to edit the
  40 ``HADOOP_CLASSPATH`` variable in ``hadoop-env.sh``.
  41
  42 The native Ceph file system client must be installed on each participating
  43 node in the Hadoop cluster.
  44
  45 Hadoop Configuration
  46 ====================
  47
  48 This section describes the Hadoop configuration options used to control Ceph.
  49 These options are intended to be set in the Hadoop configuration file
  50 `conf/core-site.xml`.
  51
  52 +---------------------+--------------------------+----------------------------+
  53 |Property             |Value                     |Notes                       |
  54 |                     |                          |                            |
  55 +=====================+==========================+============================+
  56 |fs.default.name      |Ceph URI                  |ceph://[monaddr:port]/      |
  57 |                     |                          |                            |
  58 |                     |                          |                            |
  59 +---------------------+--------------------------+----------------------------+
  60 |ceph.conf.file       |Local path to ceph.conf   |/etc/ceph/ceph.conf         |
  61 |                     |                          |                            |
  62 |                     |                          |                            |
  63 |                     |                          |                            |
  64 +---------------------+--------------------------+----------------------------+
  65 |ceph.conf.options    |Comma separated list of   |opt1=val1,opt2=val2         |
  66 |                     |Ceph configuration        |                            |
  67 |                     |key/value pairs           |                            |
  68 |                     |                          |                            |
  69 +---------------------+--------------------------+----------------------------+
  70 |ceph.root.dir        |Mount root directory      |Default value: /            |
  71 |                     |                          |                            |
  72 |                     |                          |                            |
  73 +---------------------+--------------------------+----------------------------+
  74 |ceph.mon.address     |Monitor address           |host:port                   |
  75 |                     |                          |                            |
  76 |                     |                          |                            |
  77 |                     |                          |                            |
  78 +---------------------+--------------------------+----------------------------+
  79 |ceph.auth.id         |Ceph user id              |Example: admin              |
  80 |                     |                          |                            |
  81 |                     |                          |                            |
  82 |                     |                          |                            |
  83 +---------------------+--------------------------+----------------------------+
  84 |ceph.auth.keyfile    |Ceph key file             |                            |
  85 |                     |                          |                            |
  86 |                     |                          |                            |
  87 |                     |                          |                            |
  88 +---------------------+--------------------------+----------------------------+
  89 |ceph.auth.keyring    |Ceph keyring file         |                            |
  90 |                     |                          |                            |
  91 |                     |                          |                            |
  92 |                     |                          |                            |
  93 +---------------------+--------------------------+----------------------------+
  94 |ceph.object.size     |Default file object size  |Default value (64MB):       |
  95 |                     |in bytes                  |67108864                    |
  96 |                     |                          |                            |
  97 |                     |                          |                            |
  98 +---------------------+--------------------------+----------------------------+
  99 |ceph.data.pools      |List of Ceph data pools   |Default value: default Ceph |
 100 |                     |for storing file.         |pool.                       |
 101 |                     |                          |                            |
 102 |                     |                          |                            |
 103 +---------------------+--------------------------+----------------------------+
 104 |ceph.localize.reads  |Allow reading from file   |Default value: true         |
 105 |                     |replica objects           |                            |
 106 |                     |                          |                            |
 107 |                     |                          |                            |
 108 +---------------------+--------------------------+----------------------------+
 109
 110 Support For Per-file Custom Replication
 111 ---------------------------------------
 112
 113 The Hadoop file system interface allows users to specify a custom replication
 114 factor (e.g. 3 copies of each block) when creating a file. However, object
 115 replication factors in the Ceph file system are controlled on a per-pool
 116 basis, and by default a Ceph file system will contain only a single
 117 pre-configured pool. Thus, in order to support per-file replication with
 118 Hadoop over Ceph, additional storage pools with non-default replications
 119 factors must be created, and Hadoop must be configured to choose from these
 120 additional pools.
 121
 122 Additional data pools can be specified using the ``ceph.data.pools``
 123 configuration option. The value of the option is a comma separated list of
 124 pool names. The default Ceph pool will be used automatically if this
 125 configuration option is omitted or the value is empty. For example, the
 126 following configuration setting will consider the pools ``pool1``, ``pool2``, and
 127 ``pool5`` when selecting a target pool to store a file. ::
 128
 129         <property>
 130           <name>ceph.data.pools</name>
 131           <value>pool1,pool2,pool5</value>
 132         </property>
 133
 134 Hadoop will not create pools automatically. In order to create a new pool with
 135 a specific replication factor use the ``ceph osd pool create`` command, and then
 136 set the ``size`` property on the pool using the ``ceph osd pool set`` command. For
 137 more information on creating and configuring pools see the `RADOS Pool
 138 documentation`_.
 139
 140 .. _RADOS Pool documentation: ../../rados/operations/pools
 141
 142 Once a pool has been created and configured the metadata service must be told
 143 that the new pool may be used to store file data. A pool is be made available
 144 for storing file system data using the ``ceph fs add_data_pool`` command.
 145
 146 First, create the pool. In this example we create the ``hadoop1`` pool with
 147 replication factor 1. ::
 148
 149     ceph osd pool create hadoop1 100
 150     ceph osd pool set hadoop1 size 1
 151
 152 Next, determine the pool id. This can be done by examining the output of the
 153 ``ceph osd dump`` command. For example, we can look for the newly created
 154 ``hadoop1`` pool. ::
 155
 156     ceph osd dump | grep hadoop1
 157
 158 The output should resemble::
 159
 160     pool 3 'hadoop1' rep size 1 min_size 1 crush_ruleset 0...
 161
 162 where ``3`` is the pool id. Next we will use the pool id reference to register
 163 the pool as a data pool for storing file system data. ::
 164
 165     ceph fs add_data_pool cephfs 3
 166
 167 The final step is to configure Hadoop to consider this data pool when
 168 selecting the target pool for new files. ::
 169
 170         <property>
 171                 <name>ceph.data.pools</name>
 172                 <value>hadoop1</value>
 173         </property>
 174
 175 Pool Selection Rules
 176 ~~~~~~~~~~~~~~~~~~~~
 177
 178 The following rules describe how Hadoop chooses a pool given a desired
 179 replication factor and the set of pools specified using the
 180 ``ceph.data.pools`` configuration option.
 181
 182 1. When no custom pools are specified the default Ceph data pool is used.
 183 2. A custom pool with the same replication factor as the default Ceph data
 184    pool will override the default.
 185 3. A pool with a replication factor that matches the desired replication will
 186    be chosen if it exists.
 187 4. Otherwise, a pool with at least the desired replication factor will be
 188    chosen, or the maximum possible.
 189
 190 Debugging Pool Selection
 191 ~~~~~~~~~~~~~~~~~~~~~~~~
 192
 193 Hadoop will produce log file entry when it cannot determine the replication
 194 factor of a pool (e.g. it is not configured as a data pool). The log message
 195 will appear as follows::
 196
 197     Error looking up replication of pool: <pool name>
 198
 199 Hadoop will also produce a log entry when it wasn't able to select an exact
 200 match for replication. This log entry will appear as follows::
 201
 202     selectDataPool path=<path> pool:repl=<name>:<value> wanted=<value>