We expect you have a valid Docker installed on your machine.
Get the image with docker pull datalayer/spitfire.
Git clone https://github.com/datalayer/datalayer-docker and start the Datalayer Spitfire server with the start.sh script located in the spitfire folder.
If you don’t want to clone the repository, simply run the following:
curl https://raw.githubusercontent.com/datalayer/datalayer-docker/master/spitfire/start.sh -o start.sh ; bash start.sh
You can now browse http://localhost:8666 to view the Notebook welcome page an analyse datasets.
Choose the Sign Up menu to create your profile. Once your profile is created, you can read the documentation to know more about the offered functionalities.
By default, Spark will use local[*] as master URL. If you want to evaluate with a local Hadoop cluster in Docker, Datalayer aslo provide combined up-to-date Docker image for Datalayer Spitfire with Apache Ambari.
Spark in YARN mode
It is possible to connect from the Docker image to an external Hadoop cluster (or one that runs on the Docker host) in yarn mode.
The following sections detail the needed steps to connect to such an external Hadoop cluster
HDFS File System
You will need a user folder for the user running the Zeppelin process in the Docker image.
For now, the root user is used by the Docker image. Type the following commands (or equivalent) to create the needed folders in the Hadoop HDFS cluster.
sudo -u hdfs hdfs dfs -mkdir /user/root sudo -u hdfs hdfs dfs -chown -R root:hdfs /user/root sudo -u hdfs hdfs dfs -ls /user/root
If you run with a cluster deployed via Ambari, perform the following configurations.
First, ensure the Spark Interpreter is configured with spark.hadoop.yarn.timeline-service.enabled=false (currently, Spark 2.0.0 has issued with the Timeline service…).
Then set the hdp.version to avoid any bad substitution exception when running the process on the Hadoop cluster nodes.
For this, first get the hdp.version value with hdp-select status hadoop-client | sed 's/hadoop-client - //' (example: 220.127.116.11-258).
Then define the following Spark interpreter settings in the Zeppelin UI:
You will start the Docker process with the following configuration:
- DOCKER_HADOOP_CONF_DIR is the folder where the hadoop configuration files (core-site.xml…) are located.
- DOCKER_SPARK_MASTER is yarn.
The command to use is for example:
DOCKER_HADOOP_CONF_DIR=/etc/hadoop/conf DOCKER_SPARK_MASTER=yarn ./start.sh
From the Docker container, the hostname used in the Hadoop configuration must be correctly resolved.
This can be achieved via the standard DNS. However, if you are using a Development environment where all servers reside on your laptop, additional configuration is needed.
Let’s suppose the hostname used in the Hadoop configuration files is datalayer-laptop. First take a note of the docker0 interface on your host: Type ifconfig:
docker0 Link encap:Ethernet HWaddr 02:42:fe:d1:76:25 inet addr:172.17.0.1 Bcast:0.0.0.0 Mask:255.255.0.0 inet6 addr: fe80::42:feff:fed1:7625/64 Scope:Link UP BROADCAST MULTICAST MTU:1500 Metric:1
In this case, the assigned IP adress is 172.17.0.1.
Add an enty in the /etc/hosts file of your host, and the exact same line in the /etc/hosts file of your Docker container running on your host:
This ensure that the RPC requests are directed to the correct IP address (the Hadoop servers tend to bind to a specific IP adress and not respond to other IP adresses).
Kill the Zeppelin process (if any), and restart with ./bin/zeppelin-start.sh
You should be able to run Spark job in YARN mode on an Hadoop cluster external to the Docker image.