Zeppelin: Add Zeppelin image to Spark example

This adds a very basic Zeppelin image that works with the existing Spark example. As can be seen from the documentation, it has a couple of warts: * It requires kubectl port-forward (which is unstable across long periods of time, at least for me, on this app, bug incoming). See * I needed to roll my own container (none of the existing containers exactly matched needs, or even built anymore against modern Zeppelin master, and the rest of the example is Spark 1.5). The image itself is *huge*. One of the further refinements we need to look at is how to possibly strip the Maven build for this container down to just the interpreters we care about, because the deps here are frankly ridiculous. This might be a case where, if possible, we might want to open an upstream request to build things dynamically, then use something like probably the cut the image down considerably. (This might already be possible, need to poke at whether you can late-bind interpreters later.)
2015-11-05 14:56:15 -08:00 · 2015-11-05 14:56:15 -08:00 · 51817850ba
parent 6a34b2cc04
commit 51817850ba
10 changed files with 239 additions and 29 deletions
--- a/examples/examples_test.go
+++ b/examples/examples_test.go
@ -355,11 +355,12 @@ func TestExampleObjectSchemas(t *testing.T) {
 			"secret":     &api.Secret{},
 		},
 		"../examples/spark": {
-			"spark-driver-controller": &api.ReplicationController{},
 			"spark-master-controller": &api.ReplicationController{},
 			"spark-master-service":    &api.Service{},
 			"spark-webui":             &api.Service{},
 			"spark-worker-controller": &api.ReplicationController{},
+			"zeppelin-controller":     &api.ReplicationController{},
+			"zeppelin-service":        &api.Service{},
 		},
 		"../examples/spark/spark-gluster": {
 			"spark-master-service":    &api.Service{},
--- a/examples/spark/README.md
+++ b/examples/spark/README.md
@ -120,8 +120,8 @@ Spark Command: /usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java -cp /opt/spark-1.5
 15/10/27 21:25:07 INFO Master: I have been elected leader! New state: ALIVE
 ```

-After you know the master is running, you can use the (cluster
-proxy)[../../docs/user-guide/accessing-the-cluster.md#using-kubectl-proxy] to
+After you know the master is running, you can use the [cluster
+proxy](../../docs/user-guide/accessing-the-cluster.md#using-kubectl-proxy) to
 connect to the Spark WebUI:

 ```console
@ -129,7 +129,7 @@ kubectl proxy --port=8001
 ```

 At which point the UI will be available at
-http://localhost:8001/api/v1/proxy/namespaces/default/services/spark-webui/
+[http://localhost:8001/api/v1/proxy/namespaces/default/services/spark-webui/](http://localhost:8001/api/v1/proxy/namespaces/default/services/spark-webui/).

 ## Step Two: Start your Spark workers

@ -172,32 +172,40 @@ you should now see the workers in the UI as well. *Note:* The UI will have links
 to worker Web UIs. The worker UI links do not work (the links will attempt to
 connect to cluster IPs, which Kubernetes won't proxy automatically).

-## Step Three: Start your Spark driver to launch jobs on your Spark cluster
+## Step Three: Start the Zeppelin UI to launch jobs on your Spark cluster

-The Spark driver is used to launch jobs into Spark cluster. You can read more about it in
-[Spark architecture](https://spark.apache.org/docs/latest/cluster-overview.html).
+The Zeppelin UI pod can be used to launch jobs into the Spark cluster either via
+a web notebook frontend or the traditional Spark command line. See
+[Zeppelin](https://zeppelin.incubator.apache.org/) and
+[Spark architecture](https://spark.apache.org/docs/latest/cluster-overview.html)
+for more details.

 ```console
-$ kubectl create -f examples/spark/spark-driver-controller.yaml
-replicationcontrollers/spark-driver-controller
+$ kubectl create -f examples/spark/zeppelin-controller.yaml
+replicationcontrollers/zeppelin-controller
 ```

-The Spark driver needs the Master service to be running.
+Zeppelin needs the Master service to be running.

-### Check to see if the driver is running
+### Check to see if Zeppelin is running

 ```console
-$ kubectl get pods -lcomponent=spark-driver
-NAME                            READY     STATUS    RESTARTS   AGE
-spark-driver-controller-vwb9c   1/1       Running   0          1m
+$ kubectl get pods -lcomponent=zeppelin
+NAME                        READY     STATUS    RESTARTS   AGE
+zeppelin-controller-ja09s   1/1       Running   0          53s
 ```

 ## Step Four: Do something with the cluster

-Use the kubectl exec to connect to Spark driver and run a pipeline.
+Now you have two choices, depending on your predilections. You can do something
+graphical with the Spark cluster, or you can stay in the CLI.
+
+### Do something fast with pyspark!
+
+Use the kubectl exec to connect to the Zeppelin driver and run a pipeline.

 ```console
-$ kubectl exec spark-driver-controller-vwb9c -it pyspark
+$ kubectl exec zeppelin-controller-ja09s -it pyspark
 Python 2.7.9 (default, Mar  1 2015, 12:57:24)
 [GCC 4.9.2] on linux2
 Type "help", "copyright", "credits" or "license" for more information.
@ -217,6 +225,24 @@ SparkContext available as sc, HiveContext available as sqlContext.
 Congratulations, you just counted all of the words in all of the plays of
 Shakespeare.

+### Do something graphical and shiny!
+
+Take the Zeppelin pod from above and port-forward the WebUI port:
+
+```console
+$ kubectl port-forward zeppelin-controller-ja09s 8080:8080
+```
+
+This forwards `localhost` 8080 to container port 8080. You can then find
+Zeppelin at (https://localhost:8080/)[https://localhost:8080/].
+
+Create a "New Notebook". In there, type:
+
+```
+%pyspark
+print sc.textFile("gs://dataflow-samples/shakespeare/*").map(lambda s: len(s.split())).sum()
+```
+
 ## Result

 You now have services and replication controllers for the Spark master, Spark
@ -235,10 +261,46 @@ After it's setup:

 ```console
 kubectl get pods # Make sure everything is running
-kubectl proxy --port=8001 # Start an application proxy, if you want to see the Spark WebUI
-kubectl get pods -lcomponent=spark-driver # Get the driver pod to interact with.
+kubectl proxy --port=8001 # Start an application proxy, if you want to see the Spark Master WebUI
+kubectl get pods -lcomponent=zeppelin # Get the driver pod to interact with.
 ```

+At which point the Master UI will be available at
+[http://localhost:8001/api/v1/proxy/namespaces/default/services/spark-webui/](http://localhost:8001/api/v1/proxy/namespaces/default/services/spark-webui/).
+
+You can either interact with the Spark cluster the traditional `spark-shell` /
+`spark-subsubmit` / `pyspark` commands by using `kubectl exec` against the
+`zeppelin-controller` pod, or if you want to interact with Zeppelin:
+
+```console
+kubectl port-forward zeppelin-controller-abc123 8080:8080 &
+```
+
+Then visit [http://localhost:8080/](http://localhost:8080/).
+
+## Known Issues With Spark
+
+* This provides a Spark configuration that is restricted to the cluster network,
+  meaning the Spark master is only available as a cluster service. If you need
+  to submit jobs using external client other than Zeppelin or `spark-submit` on
+  the `zeppelin` pod, you will need to provide a way for your clients to get to
+  the
+  [`examples/spark/spark-master-service.yaml`](spark-master-service.yaml). See
+  [Services](../../docs/user-guide/services.md) for more information.
+
+## Known Issues With Zeppelin
+
+* The Zeppelin pod is large, so it may take a while to pull depending on your
+  network. The size of the Zeppelin pod is something we're working on, see issue #17231.
+
+* Zeppelin may take some time (about a minute) on this pipeline the first time
+  you run it. It seems to take considerable time to load.
+
+* On GKE, `kubectl port-forward` may not be stable over long periods of time. If
+  you see Zeppelin go into `Disconnected` state (there will be a red dot on the
+  top right as well), the `port-forward` probably failed and needs to be
+  restarted. See #12179.
+
 <!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
 [![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/examples/spark/README.md?pixel)]()
 <!-- END MUNGE: GENERATED_ANALYTICS -->
--- a/examples/spark/images/Makefile
+++ b/examples/spark/images/Makefile
@ -1,12 +1,20 @@
 all: push
+push: push-spark push-zeppelin
+.PHONY: push push-spark push-zeppelin spark zeppelin

 # To bump the Spark version, bump the version in base/Dockerfile, bump
-# this tag and reset to v1. You should also double check the native
-# Hadoop libs at that point (we grab the 2.6.1 libs, which are
-# appropriate for 1.5.1-with-2.6).
+# the version in zeppelin/Dockerfile, bump this tag and reset to
+# v1. You should also double check the native Hadoop libs at that
+# point (we grab the 2.6.1 libs, which are appropriate for
+# 1.5.1-with-2.6). Note that you'll need to re-test Zeppelin (and it
+# may not have caught up to newest Spark).
 TAG = 1.5.1_v2

-containers:
+# To bump the Zeppelin version, bump the version in
+# zeppelin/Dockerfile and bump this tag and reset to v1.
+ZEPPELIN_TAG = v0.5.5_v1
+
+spark:
 	docker build -t gcr.io/google_containers/spark-base base
 	docker tag gcr.io/google_containers/spark-base gcr.io/google_containers/spark-base:$(TAG)
 	docker build -t gcr.io/google_containers/spark-worker worker
@ -16,7 +24,11 @@ containers:
 	docker build -t gcr.io/google_containers/spark-driver driver
 	docker tag gcr.io/google_containers/spark-driver gcr.io/google_containers/spark-driver:$(TAG)

-push: containers
+zeppelin:
+	docker build -t gcr.io/google_containers/zeppelin zeppelin
+	docker tag -f gcr.io/google_containers/zeppelin gcr.io/google_containers/zeppelin:$(ZEPPELIN_TAG)
+
+push-spark: spark
 	gcloud docker push gcr.io/google_containers/spark-base
 	gcloud docker push gcr.io/google_containers/spark-base:$(TAG)
 	gcloud docker push gcr.io/google_containers/spark-worker
@ -26,4 +38,8 @@ push: containers
 	gcloud docker push gcr.io/google_containers/spark-driver
 	gcloud docker push gcr.io/google_containers/spark-driver:$(TAG)

+push-zeppelin: zeppelin
+	gcloud docker push gcr.io/google_containers/zeppelin
+	gcloud docker push gcr.io/google_containers/zeppelin:$(ZEPPELIN_TAG)
+
 clean:
--- a/examples/spark/images/base/Dockerfile
+++ b/examples/spark/images/base/Dockerfile
@ -1,4 +1,4 @@
-FROM java:latest
+FROM java:openjdk-8-jdk

 ENV hadoop_ver 2.6.1
 ENV spark_ver 1.5.1
--- a/examples/spark/images/zeppelin/Dockerfile
+++ b/examples/spark/images/zeppelin/Dockerfile
@ -0,0 +1,66 @@
+# Copyright 2015 The Kubernetes Authors All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# Based heavily on
+# https://github.com/dylanmei/docker-zeppelin/blob/master/Dockerfile
+# (which is similar to many others out there), but rebased onto maven
+# image.
+#
+# This image is a composition of the official docker-maven
+# Docker image from https://github.com/carlossg/docker-maven/ and
+# spark-base.
+
+FROM gcr.io/google_containers/spark-base:latest
+
+ENV ZEPPELIN_TAG  v0.5.5
+ENV MAVEN_VERSION 3.3.3
+ENV SPARK_MINOR   1.5
+ENV SPARK_PATCH   1
+ENV SPARK_VER     ${SPARK_MINOR}.${SPARK_PATCH}
+ENV HADOOP_MINOR  2.6
+ENV HADOOP_PATCH  1
+ENV HADOOP_VER    ${HADOOP_MINOR}.${HADOOP_PATCH}
+
+RUN curl -fsSL http://archive.apache.org/dist/maven/maven-3/${MAVEN_VERSION}/binaries/apache-maven-${MAVEN_VERSION}-bin.tar.gz | tar xzf - -C /usr/share \
+  && mv /usr/share/apache-maven-${MAVEN_VERSION} /usr/share/maven \
+  && ln -s /usr/share/maven/bin/mvn /usr/bin/mvn
+
+ENV MAVEN_HOME /usr/share/maven
+
+# libfontconfig is a workaround for
+# https://github.com/karma-runner/karma/issues/1270, which caused a
+# build break similar to
+# https://www.mail-archive.com/users@zeppelin.incubator.apache.org/msg01586.html
+
+RUN apt-get update \
+  && apt-get install -y net-tools build-essential git wget unzip python python-setuptools python-dev python-numpy libfontconfig \
+  && apt-get clean \
+  && rm -rf /var/lib/apt/lists/*
+
+RUN git clone https://github.com/apache/incubator-zeppelin.git --branch ${ZEPPELIN_TAG} /opt/zeppelin
+RUN cd /opt/zeppelin && \
+  mvn clean package \
+    -Pspark-${SPARK_MINOR} -Dspark.version=${SPARK_VER} \
+    -Phadoop-${HADOOP_MINOR} -Dhadoop.version=${HADOOP_VER} \
+    -Ppyspark \
+    -DskipTests && \
+  rm -rf /root/.m2 && \
+  rm -rf /root/.npm && \
+  echo "Successfully built Zeppelin"
+
+ADD zeppelin-log4j.properties /opt/zeppelin/conf/log4j.properties
+ADD zeppelin-env.sh /opt/zeppelin/conf/zeppelin-env.sh
+ADD docker-zeppelin.sh /opt/zeppelin/bin/docker-zeppelin.sh
+EXPOSE 8080
+ENTRYPOINT ["/opt/zeppelin/bin/docker-zeppelin.sh"]
--- a/examples/spark/images/zeppelin/docker-zeppelin.sh
+++ b/examples/spark/images/zeppelin/docker-zeppelin.sh
@ -0,0 +1,21 @@
+#!/bin/bash
+
+# Copyright 2015 The Kubernetes Authors All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+export ZEPPELIN_HOME=/opt/zeppelin
+export ZEPPELIN_CONF_DIR="${ZEPPELIN_HOME}/conf"
+
+echo "=== Launching Zeppelin under Docker ==="
+/opt/zeppelin/bin/zeppelin.sh "${ZEPPELIN_CONF_DIR}"
--- a/examples/spark/images/zeppelin/zeppelin-env.sh
+++ b/examples/spark/images/zeppelin/zeppelin-env.sh
@ -0,0 +1,26 @@
+#!/bin/bash
+
+# Copyright 2015 The Kubernetes Authors All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+export MASTER="spark://spark-master:7077"
+export SPARK_HOME=/opt/spark
+export ZEPPELIN_JAVA_OPTS="-Dspark.jars=/opt/spark/lib/gcs-connector-latest-hadoop2.jar"
+# TODO(zmerlynn): Setting global CLASSPATH *should* be unnecessary,
+# but ZEPPELIN_JAVA_OPTS isn't enough here. :(
+export CLASSPATH="/opt/spark/lib/gcs-connector-latest-hadoop2.jar"
+export ZEPPELIN_NOTEBOOK_DIR="${ZEPPELIN_HOME}/notebook"
+export ZEPPELIN_MEM=-Xmx1024m
+export ZEPPELIN_PORT=8080
+export PYTHONPATH="${SPARK_HOME}/python:${SPARK_HOME}/python/lib/py4j-0.8.2.1-src.zip"
--- a/examples/spark/images/zeppelin/zeppelin-log4j.properties
+++ b/examples/spark/images/zeppelin/zeppelin-log4j.properties
@ -0,0 +1,6 @@
+# Set everything to be logged to the console.
+log4j.rootCategory=INFO, console
+log4j.appender.console=org.apache.log4j.ConsoleAppender
+log4j.appender.console.target=System.err
+log4j.appender.console.layout=org.apache.log4j.PatternLayout
+log4j.appender.console.layout.ConversionPattern=%5p [%d] ({%t} %F[%M]:%L) - %m%n
--- a/examples/spark/spark-driver-controller.yaml
+++ b/examples/spark/spark-driver-controller.yaml
@ -1,19 +1,21 @@
 kind: ReplicationController
 apiVersion: v1
 metadata:
-  name: spark-driver-controller
+  name: zeppelin-controller
 spec:
  replicas: 1
  selector:
-    component: spark-driver
+    component: zeppelin
  template:
    metadata:
      labels:
-        component: spark-driver
+        component: zeppelin
    spec:
      containers:
-        - name: spark-driver
-          image: gcr.io/google_containers/spark-driver:1.5.1_v2
+        - name: zeppelin
+          image: gcr.io/google_containers/zeppelin:v0.5.5_v1
+          ports:
+            - containerPort: 8080
          resources:
            requests:
              cpu: 100m
--- a/examples/spark/zeppelin-service.yaml
+++ b/examples/spark/zeppelin-service.yaml
@ -0,0 +1,10 @@
+kind: Service
+apiVersion: v1
+metadata:
+  name: zeppelin
+spec:
+  ports:
+    - port: 8080
+      targetPort: 8080
+  selector:
+    component: zeppelin