Automated Big Data Stack Cloud Deployment and Configuration

I am happy to announce that we recently open-sourced under the BSD license, ¬†the tools I’ve developed and used for my research at UCSB, to automatically deploy and configure in high availability mode a full big data stack in the cloud. The tools automate the deployment and configuration of Apache Mesos cluster manager and Spark, Hadoop, Storm, Kafka, and Hama data processing engines in Eucalyptus private cloud. High Availability mode is also supported (A full functioning Zookeeper cluster, and Mesos masters/ secondary masters are setup automatically).

These tools have been severely tested with Mesos, Spark, Map-Reduce, and Storm for the specific versions specified on the readme file. They also provide the option to deploy a Spark standalone cluster on Eucalyptus if you don’t need Apache Mesos. The only prerequisite is that you have a running Eucalyptus cloud and root access to your cluster. Everything else is very easily configurable on the scripts and you only need to run a simple command with arguments and wait until everything is done for you!

If you want to use on Amazon EC2 you will need to change the connector (Notice though that if you only care about Spark/ Mesos deployment on EC2 a better starting point might be this github repo instead). Similarly, if you need to use with more recent versions you’ll need to modify a couple of lines on the configuration files. I’ll try to support any reasonable requests but in general you are on your own ūüôā

Happy deployments!!!

 

Advertisements

Most Useful Hadoop Commands

This guide is not meant to be comprehensive and I am not trying to list simple and very frequently used commands like hadoop fs -ls that are very similar to what we use in a linux shell. If you need a comprehensive guide you can always check the Hadoop commands manual. Instead, I am listing some more “special” commands that I use frequently to manage a Hadoop cluster (HDFS, MapReduce, Zookeepers etc) in the long term and to detect/ fix problems. This is a work in progress post – I intend to update it with new commands based on how often I am using them on my clusters. Feel free to add a comment with commands you find more useful to perform administrative tasks.

HDFS

  1. Set the dfs replication recursively for all existing files
    hadoop dfs -setrep -w 1 -R /
  2. Create a report to check HDFS health
    hadoop dfsadmin -report
  3. Check HDFS file system
    hadoop fsck /
  4. Run cluster balancer – make sure files are distributed in a balanced way across slaves.
    sudo -u hdfs hdfs balancer
  5. Use after removing or adding a datanode
    hadoop dfsadmin -refreshNodes
    sudo -u mapred hadoop mradmin -refreshNodes
  6. When hadoop master enters safe node (often because disk space is not enough to support your desired replication factor)
     hadoop dfsadmin -safemode leave
  7. Display the datanodes that store a particular file with name “filename”.
    hadoop fsck /file-path/filename -files -locations -blocks

    Sample output:

     FSCK started by root (auth:SIMPLE) from /10.2.31.174 for path /spark-1.2.1-bin-2.3.0-mr1-cdh5.1.2.tgz at Wed May 27 17:31:15 PDT 2015
    
     /spark-1.2.1-bin-2.3.0-mr1-cdh5.1.2.tgz 186192499 bytes, 2 block(s): OK
    
     0. BP-323016323-10.2.31.174-1424856295335:blk_1073799439_60141 len=134217728 repl=3 [ip1:50010, ip2:50010, ip3:50010]
    
     1. BP-323016323-10.2.31.174-1424856295335:blk_1073799440_60142 len=51974771 repl=3 [ip1:50010, ip2:50010, ip3:50010]Status: HEALTHY
    
     Total size: 186192499 B
    
     Total dirs: 0
    
     Total files: 1
    
     Total symlinks: 0
    
     Total blocks (validated): 2 (avg. block size 93096249 B)
    
     Minimally replicated blocks: 2 (100.0 %)
    
     Over-replicated blocks: 0 (0.0 %)
    
     Under-replicated blocks: 0 (0.0 %)
    
     Mis-replicated blocks: 0 (0.0 %)
    
     Default replication factor: 3
    
     Average block replication: 3.0
    
     Corrupt blocks: 0
    
     Missing replicas: 0 (0.0 %)
    
     Number of data-nodes: 6
    
     Number of racks: 1
    
     FSCK ended at Wed May 27 17:31:15 PDT 2015 in 1 milliseconds

Map Reduce

  1. List active map-reduce jobs
    hadoop job -list
  2. Kill a job
    hadoop job -kill jobname
  3. Get jobtrackers state (active – standby)
    sudo -u mapred hadoop mrhaadmin -getServiceState jt1

    – where jt1 is the name of each of the jobtracker as configured on you mapred-site.xml file.

Zookeeper

  1. Initialize the High Availability state on zookeeper
    hdfs zkfc -formatZK
  2. Check mode of each zookeeper server:
    echo srvr | nc localhost 2181 | grep Mode

Hadoop on Mesos Installation Guide

This post describes how you can set up Apache Hadoop to work with Apache Mesos. In another post I also describe how you can set up Apache Spark to work on Mesos. So by following these instructions you can have Spark and Hadoop running on the same Mesos cluster. My cluster is a Eucalyptus private cloud (Version 3.4.2) with an Ubuntu 12.04.4 LTS image, but the instructions should work for any cluster running Ubuntu or even different Linux distributions after some small changes.

Prerequisites:

  • I assume you have already set up HDFS to your cluster. If not, follow my post here
  • I also assume you have already installed Mesos. If not follow¬†the instructions here

Installation Steps:

  1. Get a hadoop distribution:
    wget http://archive.cloudera.com/cdh5/cdh/5/hadoop-2.3.0-cdh5.1.2.tar.gz

    Careful: If you already have hdfs installed through apt manager make sure you download the same version with the one you are using.

  2. Run “hadoop version” to check the version you are using
  3. If for any reason you have two different versions, then the jobtracker running on your master and the task tracker running on the slave will use different versions. This might¬†lead to some funny issues. For example, the tasks will be marked as “finished” by Mesos but will actually be unfinished and you will see a fatal error on the log files of your executor.¬†The error will look like this:
    FATAL mapred.TaskTracker: Shutting down. Incompatible version or revision.TaskTracker version 
    '2.3.0-cdh5.1.0' and build '2.3.0-cdh5.1.0 from 8e266e052e423af592871e2dfe09d54c03f6a0e     
    by jenkins source checksum 7ec68264497939dee7ab5b91250cbd9' and JobTracker version '2.3.0-cdh5.1.2' 
    and build '2.3.0-cdh5.1.2 from 8e266e052e423af592871e2dfe09d54c03f6a0e8 by jenkins source checksum 
    ec11b8ec19ca2bf3e7cb1bbe4ee182 and hadoop.relaxed.worker.version.check is
     enabled and hadoop.skip.worker.version.check is not enabled.

    To avoid this error be sure to have hadoop.skip.worker.version.check set to true inside the mappred-site.xml on the cloudera configuration directory. Still, you might end up having other issues Рso I highly recommend to use the same version instead.

  4. Untar the file:
    $ tar zxf hadoop-2.3.0-cdh5.1.2.tar.gz
  5. Clone hadooponMesos:
    $ git clone https://github.com/mesos/hadoop.git hadoopOnMesos
  6. Build hadoopOnMesos:
    $ mvn package
    1. The jar file will be located on the /target directory
  7. Copy the jar both to the cloudera installation directories and the hadoop distribution you just downloaded:
     $ cp hadoopOnMesos/target/hadoop-mesos-0.0.8.jar /usr/lib/hadoop-0.20-mapreduce/lib/
    $ cp hadoopOnMesos/target/hadoop-mesos-0.0.8.jar hadoop-2.3.0-cdh5.1.2/share/hadoop/common/lib/

    Modifying cloudera to run map-reduce 1 instead of YARN. No matter if you have installed hdfs with apt manager or if you are planning to use the distribution you just downloaded you have to do the following to avoid getting this error:

    java.lang.NoClassDefFoundError: org/apache/hadoop/mapred/TaskTracker
    at org.apache.hadoop.mapred.MesosExecutor.launchTask(MesosExecutor.java:75)
  8. Configure CDH5 to run with MRv1 instead of YARN:
    $ cd hadoop-2.3.0-cdh5.1.2
    $ mv bin bin-mapreduce2
    $ mv examples examples-mapreduce2
    $ ln -s bin-mapreduce1 bin
    $ ln -s examples-mapreduce1 examples
    $ pushd etc
    $ mv hadoop hadoop-mapreduce2
    $ ln -s hadoop-mapreduce1 hadoop
    $ popd
    $ pushd share/hadoop
    $ rm mapreduce
    $ ln -s mapreduce1 mapreduce
    $ popd
    
  9. Moreover if you have installed Cloudera CDH5 with apt get (see my post here) you should do the following configuration as also described in this github issue
    $ cp target/hadoop-mesos-0.0.8.jar /usr/lib/hadoop-0.20-mapreduce/lib
    $ cat > /etc/profile.d/hadoop.sh
    $ export HADOOP_MAPRED_HOME=/usr/lib/hadoop-0.20-mapreduce
    $ export MESOS_NATIVE_JAVA_LIBRARY=/usr/local/lib/libmesos.so
    
    $ chmod +x /etc/profile.d/hadoop.sh
    $ /etc/profile.d/hadoop.sh
    
    $ cd ..
    $ rm hadoop-2.3.0-cdh5.1.2-mesos-0.20.tar.gz
    $ tar czf hadoop-2.3.0-cdh5.1.2-mesos-0.20.tar.gz hadoop-2.3.0-cdh5.1.2/
    $ hadoop fs -put hadoop-2.3.0-cdh5.1.2-mesos-0.20.tar.gz /
    

    For this step depending on your configuration you might need either to add permissions to your root user, because for cloudera hdfs user is the root or you could simple copy the .tar.gz file from a directory created by hdfs:hadoop user and group respectively and then you put the file to hdfs by running:

    sudo -u hdfs hadoop fs -put /hdfsuser/hadoop-2.3.0-cdh5.1.2-mesos.0.20.tar.gz /
    
  10. Put mesos properties into /etc/hadoop/conf.cluster-name/mapred-site.xml. For a detailed list of properties that you can set take a look here. The following properties on the mapred-site.xml file are necessary:
    
            mapred.jobtracker.taskScheduler/name
            org.apache.hadoop.mapred.MesosScheduler
    
            mapred.mesos.taskScheduler/name
            org.apache.hadoop.mapred.JobQueueTaskScheduler
            org.apache.hadoop.mapred.JobQueueTaskScheduler
    
            mapred.mesos.master/name
            zk://euca-x-x-x-x.eucalyptus.internal:2181/mesos
    
            mapred.mesos.executor.uri/name
            hdfs:/euca-x-x-x-x.eucalyptus.internal:9000/hadoop-2.3.0-cdh5.1.2-mesos.0.20.tar.gz
    
            mapred.job.tracker/name
            euca-x-x-x-x.eucalyptus.internal:9001
    
  11. Edit hadoop startup default script with any editor you want:
    $ vim /usr/lib/hadoop-0.20-mapreduce/bin/hadoop-daemon.sh

    Export in the beginning of this script the location of MESOS_NATIVE_JAVA_LIBRARY

    export MESOS_NATIVE_JAVA_LIBRARY=/usr/local/lib/libmesos.so
    

    If you don’t do this you will get an error that looks like the following when you try to start the jobtracker:

    FATAL org.apache.hadoop.mapred.JobTrackerHADaemon: Error encountered requiring JT shutdown. Shutting down immediately.
    java.lang.UnsatisfiedLinkError: no mesos in java.library.path
    at java.lang.ClassLoader.loadLibrary(ClassLoader.java:1886)
    at java.lang.Runtime.loadLibrary0(Runtime.java:849)
    at java.lang.System.loadLibrary(System.java:1088)
    at org.apache.mesos.MesosNativeLibrary.load(MesosNativeLibrary.java:54)
    at org.apache.mesos.MesosNativeLibrary.load(MesosNativeLibrary.java:79)
    at org.apache.mesos.MesosSchedulerDriver.(MesosSchedulerDriver.java:61)
    at org.apache.hadoop.mapred.MesosScheduler.start(MesosScheduler.java:188) 
    
  12. Start the jobtracker
    $ service hadoop-0.20-mapreduce-jobtracker start

    You should be able to see the jobtracker process by running

    jps
  13. Now you should test the setup:
    $ su mapred
    $ echo ‚ÄúI love UCSB‚ÄĚ > /tmp/file0
    $ echo ‚ÄúDo you love UCSB?‚ÄĚ > /tmp/file1
    $ hadoop fs -mkdir -p /user/foo/data
    $ hadoop fs -put /tmp/file? /user/foo/data
    $ hadoop jar /usr/lib/hadoop-0.20-mapreduce/hadoop-examples-2.3.0-mr1-cdh5.1.2.jar wordcount /user/foo/data /user/foo/out
    $ hadoop fs -ls /user/foo/out
    $ hadoop fs -cat /user/foo/out/part*
    
  14. *Notice that we didn’t install¬†or started a TaskTracker on the slave machines. Mesos will be responsible to start a Task tracker when you submit a Hadoop job.