Cloudera HDFS CDH5 Installation to use with Mesos

This post is a guide for installing Cloudera HDFS CDH5 on Eucalyptus (Version 3.4.2) in order to use it later with Apache Mesos. The difference of installing HDFS for any kind of use is that we don’t start the Tasktracker on the slave nodes. This is something that Mesos will do each time a hadoop job is running.

The steps you should take are the following:

  • Disable iptables firewall:
    $ ufw disable
  • Disable selinux:
    $ setenforce 0
  • Make sure instances have unique hostnames
  • Make sure the /etc/hosts file on each system has the IP addresses and fully-qualified domain names (FQDN) of all the members of the cluster.
    • hostname –fqdn
    • An example configuration is the following:
      • For the datanode:
        127.0.0.1 localhost
        10.2.24.25 euca-10-2-24-25.eucalyptus.internal
        x.x.x.x euca-10-2-24-25
        
      • For the namenode:
        127.0.0.1 localhost
        10.2.85.213 euca-10-2-85-213.eucalyptus.internal
        y.y.y.y euca-10-2-85-213
        
      • where x.x.x.x and y.y.y.y are the external IPs of your namenode and datanode respectively.
  • Be careful to create the directories that the data  node and name node are using andwill be set later on the configuration files inside the /etc/hadoop/conf.name directory. Also remember to change ownership tohdfs:hdfs
    • on data node:
      $ mkdir -p /mnt/cloudera-hdfs/1/dfs/dn /mnt/cloudera-hdfs/2/dfs/dn /mnt/cloudera-hdfs/3/dfs/dn /mnt/cloudera-hdfs/4/dfs/dn
      $ chown -R hdfs:hdfs /mnt/cloudera-hdfs/1/dfs/dn /mnt/cloudera-hdfs/2/dfs/dn /mnt/cloudera-hdfs/3/dfs/dn /mnt/cloudera-hdfs/4/dfs/dn
      • Typically each of the /1 /2 /3 /4 directories should be different mounted devices. Though using Eucalyptus volumes to do so might add up some latency.
    • on name node:
      $ mkdir -p /mnt/cloudera-hdfs/1/dfs/nn /nfsmount/dfs/nn
      $ chown -R hdfs:hdfs /mnt/cloudera-hdfs/1/dfs/nn /nfsmount/dfs/nn
      $ chmod 700 /mnt/cloudera-hdfs/1/dfs/nn /nfsmount/dfs/nn
  • Deploy configuration to all nodes in the cluster
  • Set alternatives to each node:
    $ update-alternatives --install /etc/hadoop/conf hadoop-conf /etc/hadoop/conf.mesos-cluster 50
    $ update-alternatives --set hadoop-conf /etc/hadoop/conf.mesos-cluster
    $ update-alternatives --set hadoop-conf /etc/hadoop/conf.mesos-cluster
  • Add configuration to core-site.xml located under /etc/hadoop/
    • Make sure they have the hostnames – not the IP address of the NameNode
  • Install cloudera CDH5: There are multiple ways to do this. Probably the easiest is the following:
    $ wget http://archive.cloudera.com/cdh5/one-click-install/precise/amd64/cdh5-repository_1.0_all.deb
    $ dpkg -i cdh5-repository_1.0_all.deb
    • To master node:
      $ sudo apt-get update;
      $ sudo apt-get update; sudo apt-get install hadoop-hdfs-namenode
    • To slave nodes:
      $ sudo apt-get update; sudo apt-get install hadoop-hdfs-datanode
      $ sudo apt-get update; sudo apt-get install hadoop-client

      cp /etc/hadoop/conf.empty/log4j.properties /etc/hadoop/conf.name/log4j.properties

  • Upgrade/ Format namenode:
    $ service hadoop-hdfs-namenode upgrade
    $ sudo -u hdfs hdfs namenode -format
  • *If the storage directories change hdfs should be reformatted
  • Start HDFS:
    • On master:
      $ service hadoop-hdfs-namenode start
    • On slave:
      $ service hadoop-hdfs-datanode start
  • Optionally: Configure your cluster to start the services after a system restart
    • On master node:
      $ update-rc.d hadoop-hdfs-namenode defaults
      
      • If you have also setted up zookeeper then:
        $ update-rc.d zookeeper-server defaults
    • On the slaves:
      $ update-rc.d hadoop-hdfs-datanode defaults