Vagrant Tutorial - Spark in a VM
Configuring my computer's environment for different tools without risk of breaking my work computer sounds like a dream. Bonus points to use those configurations to practice firing up a cluster of virtual machines. In comes Vagrant.
Vagrant wraps around other virtual machines (VirtualBox in vanilla Vagrant, but can manage other providers as well) enabling the entire setup of a virtual machine to be done through a script rather than by hand. This enables version controlling of machine configuration and precisely repeatable environments for yourself and anyone else you share your box with.
Let's start from scratch and build up to a virtual machine running Spark that can be created using just a couple of keystrokes. I'm using Ubunutu 14.04, though everything here should just work on Mac as well. Every code block starting in a
$
denotes a command to be ran in your terminal.Initial Setup
Vagrant is a script-flavored wrapper for virtual machines. The out-of-the-box virtual machine is VirtualBox, so it needs to be installed before Vagrant.
$ sudo apt-get install virtualbox
$ sudo apt-get install vagrant
Let's download a pre-made box to use as our starting point. There are plenty available online (e.g., here or here). We'll be using the vanilla Ubuntu Server 12.04 LTS. Add the box to Vagrant through the
vagrant box add
command.$ vagrant box add precise32 http://files.vagrantup.com/precise32.box
Use
vagrant box list
to print the list of available boxes. Verify you now have at least precise32 (virtualbox)
in that list.
It's now time to initialize a Vagrant project. Start a new directory to play in,
cd
into that directory, and run $ vagrant init
.$ vagrant init precise32
Your project directory is now initialized with the file
Vagrantfile
which tells Vagrant what to consider the root directory for your project and is the version-controllable script that you will use to share your to-be box. We'll come back to this file later to customize our box but, for now, tell Vagrant to fire up the vm.$ vagrant up
Congratz! You've just fired up your first vagrant-based virtual machine. Now what?
Manually Install Spark
My main motivation for working with Vagrant is to make a playground for setting up and running different big data tools without breaking my primary environment (many boxes were hurt in the writing of this post). The next tool I want to test drive is Spark, so Spark is what I'll be installing into this vm.
Let's log into your new vm. Still in your project's root directory, run
$ vagrant ssh
You'll now be logged in as user
vagrant
on your precise32
box.
We'll be using a pre-built version of spark, so getting it running is fairly straight-forward. First we'll need to install a JDK.
$ sudo apt-get -y update
$ sudo apt-get -y install openjdk-7-jdk
Next, we need to download the pre-bulit spark and extract the archive.
$ wget http://d3kbcqa49mib13.cloudfront.net/spark-1.1.0-bin-hadoop1.tgz
$ tar -zxf spark-1.1.0-bin-hadoop1.tgz
Test the Installation
Let's move into the newly unzipped spark directory.
$ cd spark-1.1.0-bin-hadoop1
Spark is pre-built and should just run. Let's run the trivial example and make sure it's working. Fire up the Spark-based Scala shell.
$ ./bin/spark-shell
The following two lines entered into the newly-launched
spark-shell
will open spark's README.md
file and count its number of lines.spark> val textFile = sc.textFile("README.md")
spark> textFile.count()
You would have seen lots of stuff print out while opening the shell and after running each command. Make sure the last line you see printed is
es0: Long = 126
. If so, it worked! Nice job. I'll dive more into Spark in a future post, but for now let's get back to Vagrant. Exit the spark-shell
, exit out of your vm, and make sure you're back in the root directory of your vagrant project.Script-izing Our Installation
We're using Vagrant to let us script-ify and version control our virtual environment creation, so let's take everything we did manually up top and put it into
Vagrantfile
. Open your project's Vagrantfile
in your favorite text editor and we'll automate everything we did by hand earlier.
Your
Vagrantfile
will already have the line config.vm.box = "precise32"
telling Vagrant to use the precise32
box. Before we had to make sure we had a copy of that box locally. Alternatively, we can tell Vagrant a url to find the box and download if it's not available locally. We do that by adding the following line below the config.vm.box = precise32
line.config.vm.box_url = "http://files.vagrantup.com/precise32.box"
Next, we need to tell Vagrant to install the JDK and download and unpack Spark. We did this through four commands above. The way to tell vagrant to run a shell command takes the form
config.vm.provision :shell, inline: " "
. So, we'll want to add the following four lines next:config.vm.provision :shell, inline: "sudo apt-get -y update"
config.vm.provision :shell, inline: "sudo apt-get -y install openjdk-7-jdk"
config.vm.provision :shell, inline: "wget http://d3kbcqa49mib13.cloudfront.net/spark-1.1.0-bin-hadoop1.tgz"
config.vm.provision :shell, inline: "tar -zxf spark-1.1.0-bin-hadoop1.tgz"
You'll now have a Vagrantfile that looks like this. Let's test it.
Test Drive
We're again back in your Vagrant project's root directory. Clean out the old vm.
$ vagrant destroy
Let's also get rid of the box we manually downloaded to make sure Vagrant can download it automatically.
$ vagrant box remove precise32
Now fire up the vm from our new and shiny Vagrantfile. This is going to take some time to re-download the box and Spark.
$ vagrant up
Now ssh into vagrant like before and make sure spark still works. Congratz! You have spark running in a virtual environment that you can do your worst to with no real repercussions.
Next steps
There is plenty of great documentation as well as very useful comments that came included in the original Vagrantfile. I'll be using this to practice configuring systems. What will you use it for?