Docker Image

Datalayer provides an up-to-date Docker image for Apache Zeppelin, the WEB Notebook for Big Data Science.

Get the image with docker pull datalayer/zeppelin.

Start Zeppelin notebook with docker run -it -p 2222:22 -p 8066:8066 -p 4040:4040 datalayer/zeppelin and browse to http://localhost:8666.

The image is built with the Spark 2.0.0 interpreter (with Scala 2.11.8) in local mode.

Configuration

Configure with environment variables:

  • DOCKER_BACKGROUND_PROCESS = Run as background process (default is true).
  • DOCKER_SPARK_MASTER = MASTER for Spark (default is local[*]).
  • DOCKER_NOTEBOOK_DIR = Folder where the notes reside (default is /notebook).
  • DOCKER_WEB_PORT = The HTTP port (default is 8066).
  • DOCKER_HADOOP_CONF_DIR = The folder for the Hadoop configuration file on the host (default is /etc/hadoop/conf).

Example: DOCKER_WEB_PORT=8667 ./start.sh

Spark in YARN mode

It is possible to connect from the Docker image to an external Hadoop cluster (or one that runs on the Docker host) in yarn mode.

The following sections detail the needed steps to connect to such an external Hadoop cluster

HDFS File System

You will need a user folder for the user running the Zeppelin process in the Docker image.

For now, the root user is used by the Docker image. Type the following commands (or equivalent) to create the needed folders in the Hadoop HDFS cluster.

sudo -u hdfs hdfs dfs -mkdir /user/root
sudo -u hdfs hdfs dfs -chown -R root:hdfs /user/root
sudo -u hdfs hdfs dfs -ls /user/root

Interpreter Configuration

If you run with a cluster deployed via Ambari, perform the following configurations.

First, ensure the Spark Interpreter is configured with spark.hadoop.yarn.timeline-service.enabled=false (currently, Spark 2.0.0 has issued with the Timeline service…).

Then set the hdp.version to avoid any bad substitution exception when running the process on the Hadoop cluster nodes.

For this, first get the hdp.version value with hdp-select status hadoop-client | sed 's/hadoop-client - //' (example: 2.4.2.0-258).

Then define the following Spark interpreter settings in the Zeppelin UI:

  • hdp.version=“2.4.2.0-258”
  • spark.driver.extraJavaOptions=“-Dhdp.version=2.4.2.0-258”
  • spark.yarn.am.extraJavaOptions=“-Dhdp.version=2.4.2.0-258”

Hadoop Configuration

You will start the Docker process with the following configuration:

  • DOCKER_HADOOP_CONF_DIR is the folder where the hadoop configuration files (core-site.xml…) are located.
  • DOCKER_SPARK_MASTER is yarn.

The command to use is for example:

DOCKER_HADOOP_CONF_DIR=/etc/hadoop/conf DOCKER_SPARK_MASTER=yarn ./start.sh

Host Resolution

From the Docker container, the hostname used in the Hadoop configuration must be correctly resolved.

This can be achieved via the standard DNS. However, if you are using a Development environment where all servers reside on your laptop, additional configuration is needed.

Let’s suppose the hostname used in the Hadoop configuration files is datalayer-laptop. First take a note of the docker0 interface on your host: Type ifconfig:

docker0   Link encap:Ethernet  HWaddr 02:42:fe:d1:76:25  
          inet addr:172.17.0.1  Bcast:0.0.0.0  Mask:255.255.0.0
          inet6 addr: fe80::42:feff:fed1:7625/64 Scope:Link
          UP BROADCAST MULTICAST  MTU:1500  Metric:1

In this case, the assigned IP adress is 172.17.0.1.

Add an enty in the /etc/hosts file of your host, and the exact same line in the /etc/hosts file of your Docker container running on your host:

datalayer-laptop 172.17.0.1

This ensure that the RPC requests are directed to the correct IP address (the Hadoop servers tend to bind to a specific IP adress and not respond to other IP adresses).

Kill the Zeppelin process (if any), and restart with ./bin/zeppelin-start.sh

You should be able to run Spark job in YARN mode on an Hadoop cluster external to the Docker image.

Build

Git clone https://github.com/datalayer/datalayer-docker, cd datalayer-docker/zeppelin and build the Docker image with ./build.sh script located in the zeppelin folder..

License

Copyright 2016 Datalayer http://datalayer.io

Licensed under the Apache License, Version 2.0 (the “License”); you may not use this file except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an “AS IS” BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

Back to top