I am happy to announce that we recently open-sourced under the BSD license, the tools I’ve developed and used for my research at UCSB, to automatically deploy and configure in high availability mode a full big data stack in the cloud. The tools automate the deployment and configuration of Apache Mesos cluster manager and Spark, Hadoop, Storm, Kafka, and Hama data processing engines in Eucalyptus private cloud. High Availability mode is also supported (A full functioning Zookeeper cluster, and Mesos masters/ secondary masters are setup automatically).
These tools have been severely tested with Mesos, Spark, Map-Reduce, and Storm for the specific versions specified on the readme file. They also provide the option to deploy a Spark standalone cluster on Eucalyptus if you don’t need Apache Mesos. The only prerequisite is that you have a running Eucalyptus cloud and root access to your cluster. Everything else is very easily configurable on the scripts and you only need to run a simple command with arguments and wait until everything is done for you!
If you want to use on Amazon EC2 you will need to change the connector (Notice though that if you only care about Spark/ Mesos deployment on EC2 a better starting point might be this github repo instead). Similarly, if you need to use with more recent versions you’ll need to modify a couple of lines on the configuration files. I’ll try to support any reasonable requests but in general you are on your own 🙂
This post describes how you can set up Apache Spark to work with Apache Mesos. In another post I also describe how you can set up Apache Hadoop to work on Mesos. So by following these instructions you can have both Spark and Hadoop running on the same Mesos cluster.
*The instructions bellow have been tested with mesos 0.20 and Spark 1.1.0 ((Update: 3/16/2016: Have also tested with mesos 0.27.2 and Spark 1.6.1 on Ubuntu Trusty 14.04 – Most steps work as bellow just by changing the version numbers. I explicitly note when there is some difference on compiling steps between the two versions)
- I assume you have already set up HDFS to your cluster. If not, follow my post here
- I also assume you have already installed Mesos. If not follow the instructions here
To run Spark with Mesos you should do the following:
- Download the Spark tar file
$ wget http://d3kbcqa49mib13.cloudfront.net/spark-1.1.0.tgz
$ tar zxfv spark-1.1.0.tgz
$ cd spark-1.1.0
- Build spark
- (METHOD A)
$ sbt/sbt assembly
- (METHOD B)
$ ./make-distribution.sh --tgz
Careful! Build as necessary if you use a different HDFS version than the default (1.0.4) – For example for the Cloudera CDH5.1.2 that I described how to install in another post use:
$ ./make-distribution.sh --tgz -Dhadoop.version=2.3.0-cdh5.1.2 -Dprotobuf.version=2.5.0 -DskipTests
- Compiling with -Dhadoop.version=2.3.0-mr1-cdh5.1.2 won’t work anymore as the codehaus repo is no longer active and the files hosted now on Apache central are renamed!
- In the above example I am specifying the protobuf version to use to avoid the following error that will be thrown when trying to run Spark (This is no longer needed on the newer Spark 1.6+ version):
Exception in thread "main" java.lang.VerifyError: class org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$SetOwnerRequestProto overrides final method getUnknownFields.()Lcom/google/protobuf/UnknownFieldSet;
- You can find a detailed list of how to build with different HDFS versions here
- Careful for the Spark 1.1 version! I recommend you compile by setting your JAVA_HOME to JAVA 6. Later its ok to set it back to JAVA 7. If you compile with JAVA 7 and then run executors with JAVA 6 you will get an error on the executors. Moreover, I’ve seen this error even when having JAVA 7 on the executors. The worst part is that with a newer version of SPARK (1.2.1) there is not even an error thrown on the executors. You just see executors getting lost…
- Tar the version you’ve built and put into HDFS so it can be shipped to the executors
$ tar czfv spark-1.1.0.tgz spark-1.1.0
$ sudo -u hdfs hadoop fs -put /hdfsuser/spark-x.x.x.tgz /
Careful! If you deploy the wrong .tgz file (such as the one you just downloaded instead of the one produced after running make-distribution) then the tasks will fail. To debug you should check your executor log files. In this case the error will look like this:
ls: cannot access /tmp/mesos/slaves/20140913-135403-421003786-5050-24966-0/frameworks/20140914-153830-421003786-5050-27925-0001/executors/20140913-135403-421003786-5050-24966-0/runs/b17fb191-4db1-4aa7-8363-3ff0f12d88a3/spark-1.1.0/assembly/target/scala-2.10: No such file or directory
- Modify spark-1.1.0/conf/spark-env.sh by adding the following lines of configuration code:
- To run your applications with spark-submit without having to modify each time yourcode edit the configuration file: spark-1.1.0/conf/spark-defaults.conf by adding the following line:
# 10.2.24.25 is the internal Eucalyptus IP of Master
# euca-10-2-24-25 is the hostname – put there whatever you get by running the command `hostname`