Deploying a Spark job using Juju

Guest

on 13 September 2016

Tags: Apache , Juju , spark

This article was last updated 8 years ago.


Juju makes it easy to setup and monitor a Spark cluster with a few commands. In this guide we will setup a new cluster and deploy a Spark job using this tool. According to the official definition, Juju is a service modelling tool that allows people to model, configure and deploy applications in the cloud. It also offers some monitoring capabilities as well as scaling.

The application we will deploy detects fraud in financial transactions as explained in this blog post.

We will be using Ubuntu 14.04.4 LTS on this guide. If you’re on another OS you can use VirtualBox to run Linux by following these instructions.

As a first step we need to install Juju (version 1.25 – latest stable):

Install Juju

sudo add-apt-repository ppa:juju/stable
sudo apt-get -y update
sudo apt-get -y install juju juju-local git make sshfs
sudo pip install bundletester
sudo apt-get -y install charm-tools

For Ubuntu versions <= 12.04 you may run into issues when trying to install juju-local and need to install LXC userspace tools and the MongoDB server as mentioned here.

Generate the config file

Each juju environment needs to know how to connect to your cloud / private infrastructure in which to deploy. By default Juju ships with AWS (default), HP Cloud, OpenStack, Azure, MaaS, Local and Manual providers. This information is stored in the environments.yaml file. To generate the template file run:

juju generate-config

Which generates the config file under ~/.juju/environments.yaml.

This will fill it out with template info. Edit it to reflect the cloud provider you want to use and the credentials juju will have to use.

We will use the local environment for demo purposes and will be deploying to lxc containers.

Bootstraping

In order to create the cloud environment that Juju will use to deploy to, we need to run the bootstrap command:

juju bootstrap

This will also install the public SSH key on the machine.

Running juju status will show information about the environment, the services deployed and the units. After running the bootstrap command there are no nodes or services deployed yet which should show something like this:


juju status –format=tabular

Keep the monitoring in the background using a watch:

watch juju status --format=tabular

Installing Spark

You can deploy Spark in either Standalone or High Availability (HA) mode.

juju deploy apache-spark

This will first provision a VM and install the Spark charm on it. A Charm is a software component that contains all the instructions necessary to deploy and configure cloud-based services.

The above operation may take a few minutes. A more detailed view of what is going on int he background can be retrieved using:

juju debug-log

Once the operation has finished we shall see a master node in state Ready as shown below.

We can add new nodes to the spark cluster by running a simple juju add-unit command.

However, to enable high availability on a cluster of machines it is essential we install Zookeeper, the distributed resource coordination service that Spark uses.

To do so, we need to run

juju deploy apache-zookeeper

…which will spawn a new node (or an lxc-container in case we run it locally), download zookeeper and run it. It will be ready to use when the status is shown as Ready at the juju status command. Although a ZooKeeper node is up and running, juju status will show as Ready (1 zk units: less than 3 is suboptimal). This is because Zookeeper requires at least 3 or more (odd) number of nodes to be fully functional. Given that ZK functionality is based on agreement between distributed nodes, the majority of votes – and therefore the truth – can be achieved with 3 or more nodes. This is because if one machine fails in a cluster of 2 machines, the remaining machine does not constitute a majority. With 3 nodes, we can tolerate 1 node failure and still operate with the majority of two nodes.

In order to fix this, let’s add 2 more ZK nodes by running:

juju add-unit -n 2 apache-zookeeper

and wait until they are in state ‘Ready’ again.

Now that ZK is up and running we can add new nodes to our Spark cluster by running:

juju add-unit -n 2 apache-spark

Notice that we have not connected Spark with Zookeeper yet. To do so, we need to let juju know about the relation:

juju add-relation apache-zookeeper apache-spark

You can see that the spark cluster is now in HA (High Availability) mode. It is not clear who the master node is but we can retrieve this information by logging into one of the ZK nodes and asking for it.

juju ssh apache-zookeeper/0
zkCli.sh
get /spark/master_status

So the spark master is 10.0.3.159 ! In order to verify our cluster works correctly we need to submit a test job. The easiest way to do this is to ssh to one of the spark nodes and run:

spark-submit --class org.apache.spark.examples.SparkPi --master spark://10.0.3.159:7077 /usr/lib/spark/lib/spark-examples*.jar 10

By looking at the /var/log/spark/spark.log on each of the three nodes we can see that the job has been submitted successfully and is running. After a few seconds Pi should be ……… 3.14.

Running Anomaly Detection on cluster

First let’s make the training data available on each machine by running:

for i in $(seq 0 1 2); do
   juju ssh apache-spark/$i "
   mkdir -p /home/ubuntu/src/test/resources
   cd /home/ubuntu/src/test/resources
   wget https://raw.githubusercontent.com/mvogiatzis/spark-anomaly-detection/master/src/test/resources/training.csv
   wget https://raw.githubusercontent.com/mvogiatzis/spark-anomaly-detection/master/src/test/resources/cross_val.csv
   "
done

Also make the jar available by uploading it on the juju machine and scping it to all three nodes:

juju scp anomaly-detection_2.10-1.0.jar apache-spark/

Time to submit our application. Logging into one of the nodes that can talk to the master node, a Spark application can be submitted to the cluster as usual:

spark-submit --class MainRun --master spark://10.0.3.159:7077 --total-executor-cores 8 target/scala-2.10/anomaly-detection_2.10-1.0.jar

Looking at the logs /var/log/spark/, we can see the same results/outliers that were computed at our standalone server:

Conclusions

We saw how Juju makes the Spark setup and deployment straightforward and easy to reason about. It took a few steps to setup a Spark cluster and deploy the anomaly detection application. Details about how the fraud-detection app works can be found here.

I’d like to thank the Canonical big data team for letting me run my experiments using their powerful resources. They are doing a great job and are now also working on integrating juju charms within the Apache BigTop project which provides management, maintenance, testing and virtualization for various big data projects such as Hadoop and Spark.

Original article

About the author

Michael Vogiatzis is a Software Engineer based in London. His mission is to help companies solve their complex problems in the simplest way possible in various business areas. He cares about clean, robust code and use agile techniques to deliver tested and high quality solutions. He recently led the back-end development of a highly efficient, distributed algorithm for taxi-sharing and public transport commuting in London. Michael currently works for Net-a-Porter group.

Ubuntu cloud

Ubuntu offers all the training, software infrastructure, tools, services and support you need for your public and private clouds.

Newsletter signup

Get the latest Ubuntu news and updates in your inbox.

By submitting this form, I confirm that I have read and agree to Canonical's Privacy Policy.

Related posts

How we used Flask and 12-factor charms to simplify Canonical.com development

Learn how Canonical is using Python Flask and the 12-factor charm framework to simplify the development of Canonical.com and Ubuntu.com

Spark or Hadoop: the best choice for big data teams?

I always find the Olympics to be an unusual experience. I’m hardly an athletics fanatic, yet I can’t help but get swept up in the spirit of the competition....

Apache Spark 4.0 beta release – try it now

Apache Spark is a popular framework for developing distributed, parallel data processing applications. Our solution for Apache Spark on Kubernetes has made...