2015-03-18 14:11:17 +00:00
# Spark example
Following this example, you will create a functional [Apache
Spark](http://spark.apache.org/) cluster using Kubernetes and
[Docker ](http://docker.io ).
2016-02-20 22:32:30 +00:00
You will setup a Spark master service and a set of Spark workers using Spark's [standalone mode ](http://spark.apache.org/docs/latest/spark-standalone.html ).
2015-03-18 14:11:17 +00:00
For the impatient expert, jump straight to the [tl;dr ](#tldr )
section.
### Sources
2016-02-20 22:32:30 +00:00
The Docker images are heavily based on https://github.com/mattf/docker-spark.
And are curated in https://github.com/kubernetes/application-images/tree/master/spark
2015-03-18 14:11:17 +00:00
2016-09-27 19:01:34 +00:00
The Spark UI Proxy is taken from https://github.com/aseigneurin/spark-ui-proxy.
The PySpark examples are taken from http://stackoverflow.com/questions/4114167/checking-if-a-number-is-a-prime-number-in-python/27946768#27946768
2015-03-18 14:11:17 +00:00
## Step Zero: Prerequisites
2016-02-20 22:32:30 +00:00
This example assumes
- You have a Kubernetes cluster installed and running.
2016-09-27 19:01:34 +00:00
- That you have installed the ```kubectl``` command line tool installed in your path and configured to talk to your Kubernetes cluster
2016-10-24 17:28:07 +00:00
- That your Kubernetes cluster is running [kube-dns ](../../build-tools/kube-dns/ ) or an equivalent integration.
2016-02-20 22:32:30 +00:00
2016-09-27 19:01:34 +00:00
Optionally, your Kubernetes cluster should be configured with a Loadbalancer integration (automatically configured via kube-up or GKE)
2015-03-18 14:11:17 +00:00
2016-04-18 16:34:47 +00:00
## Step One: Create namespace
```sh
$ kubectl create -f examples/spark/namespace-spark-cluster.yaml
```
Now list all namespaces:
```sh
$ kubectl get namespaces
NAME LABELS STATUS
default < none > Active
spark-cluster name=spark-cluster Active
```
2016-09-27 19:01:34 +00:00
To configure kubectl to work with our namespace, we will create a new context using our current context as a base:
2016-04-18 16:34:47 +00:00
```sh
2016-09-27 19:01:34 +00:00
$ CURRENT_CONTEXT=$(kubectl config view -o jsonpath='{.current-context}')
$ USER_NAME=$(kubectl config view -o jsonpath='{.contexts[?(@.name == "'"${CURRENT_CONTEXT}"'")].context.user}')
$ CLUSTER_NAME=$(kubectl config view -o jsonpath='{.contexts[?(@.name == "'"${CURRENT_CONTEXT}"'")].context.cluster}')
2016-04-18 16:34:47 +00:00
$ kubectl config set-context spark --namespace=spark-cluster --cluster=${CLUSTER_NAME} --user=${USER_NAME}
$ kubectl config use-context spark
```
## Step Two: Start your Master service
2015-03-18 14:11:17 +00:00
2015-10-26 20:42:02 +00:00
The Master [service ](../../docs/user-guide/services.md ) is the master service
for a Spark cluster.
Use the
[`examples/spark/spark-master-controller.yaml` ](spark-master-controller.yaml )
file to create a
[replication controller ](../../docs/user-guide/replication-controller.md )
running the Spark Master service.
```console
$ kubectl create -f examples/spark/spark-master-controller.yaml
2016-04-18 16:34:47 +00:00
replicationcontroller "spark-master-controller" created
2015-10-26 20:42:02 +00:00
```
2015-03-18 14:11:17 +00:00
2015-10-26 20:42:02 +00:00
Then, use the
[`examples/spark/spark-master-service.yaml` ](spark-master-service.yaml ) file to
create a logical service endpoint that Spark workers can use to access the
2016-09-27 19:01:34 +00:00
Master pod:
2015-03-18 14:11:17 +00:00
2015-10-26 20:42:02 +00:00
```console
$ kubectl create -f examples/spark/spark-master-service.yaml
2016-04-18 16:34:47 +00:00
service "spark-master" created
2015-03-18 14:11:17 +00:00
```
### Check to see if Master is running and accessible
2015-10-26 20:42:02 +00:00
```console
2015-06-26 16:00:42 +00:00
$ kubectl get pods
2015-10-26 20:42:02 +00:00
NAME READY STATUS RESTARTS AGE
spark-master-controller-5u0q5 1/1 Running 0 8m
2015-03-18 14:11:17 +00:00
```
2015-10-26 20:42:02 +00:00
Check logs to see the status of the master. (Use the pod retrieved from the previous output.)
2015-03-18 14:11:17 +00:00
2015-07-20 16:40:32 +00:00
```sh
2015-10-26 20:42:02 +00:00
$ kubectl logs spark-master-controller-5u0q5
starting org.apache.spark.deploy.master.Master, logging to /opt/spark-1.5.1-bin-hadoop2.6/sbin/../logs/spark--org.apache.spark.deploy.master.Master-1-spark-master-controller-g0oao.out
Spark Command: /usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java -cp /opt/spark-1.5.1-bin-hadoop2.6/sbin/../conf/:/opt/spark-1.5.1-bin-hadoop2.6/lib/spark-assembly-1.5.1-hadoop2.6.0.jar:/opt/spark-1.5.1-bin-hadoop2.6/lib/datanucleus-rdbms-3.2.9.jar:/opt/spark-1.5.1-bin-hadoop2.6/lib/datanucleus-core-3.2.10.jar:/opt/spark-1.5.1-bin-hadoop2.6/lib/datanucleus-api-jdo-3.2.6.jar -Xms1g -Xmx1g org.apache.spark.deploy.master.Master --ip spark-master --port 7077 --webui-port 8080
2015-06-26 16:00:42 +00:00
========================================
2015-10-26 20:42:02 +00:00
15/10/27 21:25:05 INFO Master: Registered signal handlers for [TERM, HUP, INT]
15/10/27 21:25:05 INFO SecurityManager: Changing view acls to: root
15/10/27 21:25:05 INFO SecurityManager: Changing modify acls to: root
15/10/27 21:25:05 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root); users with modify permissions: Set(root)
15/10/27 21:25:06 INFO Slf4jLogger: Slf4jLogger started
15/10/27 21:25:06 INFO Remoting: Starting remoting
15/10/27 21:25:06 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkMaster@spark-master:7077]
15/10/27 21:25:06 INFO Utils: Successfully started service 'sparkMaster' on port 7077.
15/10/27 21:25:07 INFO Master: Starting Spark master at spark://spark-master:7077
15/10/27 21:25:07 INFO Master: Running Spark version 1.5.1
15/10/27 21:25:07 INFO Utils: Successfully started service 'MasterUI' on port 8080.
15/10/27 21:25:07 INFO MasterWebUI: Started MasterWebUI at http://spark-master:8080
15/10/27 21:25:07 INFO Utils: Successfully started service on port 6066.
15/10/27 21:25:07 INFO StandaloneRestServer: Started REST server for submitting applications on port 6066
15/10/27 21:25:07 INFO Master: I have been elected leader! New state: ALIVE
```
2016-09-27 19:01:34 +00:00
Once the master is started, we'll want to check the Spark WebUI. In order to access the Spark WebUI, we will deploy a [specialized proxy ](https://github.com/aseigneurin/spark-ui-proxy ). This proxy is neccessary to access worker logs from the Spark UI.
Deploy the proxy controller with [`examples/spark/spark-ui-proxy-controller.yaml` ](spark-ui-proxy-controller.yaml ):
```console
$ kubectl create -f examples/spark/spark-ui-proxy-controller.yaml
replicationcontroller "spark-ui-proxy-controller" created
```
We'll also need a corresponding Loadbalanced service for our Spark Proxy [`examples/spark/spark-ui-proxy-service.yaml` ](spark-ui-proxy-service.yaml ):
```console
$ kubectl create -f examples/spark/spark-ui-proxy-service.yaml
service "spark-ui-proxy" created
```
After creating the service, you should eventually get a loadbalanced endpoint:
```console
$ kubectl get svc spark-ui-proxy -o wide
NAME CLUSTER-IP EXTERNAL-IP PORT(S) AGE SELECTOR
spark-ui-proxy 10.0.51.107 aad59283284d611e6839606c214502b5-833417581.us-east-1.elb.amazonaws.com 80/TCP 9m component=spark-ui-proxy
```
The Spark UI in the above example output will be available at http://aad59283284d611e6839606c214502b5-833417581.us-east-1.elb.amazonaws.com
If your Kubernetes cluster is not equipped with a Loadbalancer integration, you will need to use the [kubectl proxy ](../../docs/user-guide/accessing-the-cluster.md#using-kubectl-proxy ) to
2015-10-28 18:38:03 +00:00
connect to the Spark WebUI:
2015-10-26 20:42:02 +00:00
```console
2015-10-28 18:38:03 +00:00
kubectl proxy --port=8001
2015-03-18 14:11:17 +00:00
```
2015-10-28 18:38:03 +00:00
At which point the UI will be available at
2016-09-27 19:01:34 +00:00
[http://localhost:8001/api/v1/proxy/namespaces/spark-cluster/services/spark-master:8080/ ](http://localhost:8001/api/v1/proxy/namespaces/spark-cluster/services/spark-master:8080/ ).
2015-10-26 20:42:02 +00:00
2016-04-18 16:34:47 +00:00
## Step Three: Start your Spark workers
2015-03-18 14:11:17 +00:00
The Spark workers do the heavy lifting in a Spark cluster. They
provide execution resources and data cache capabilities for your
program.
The Spark workers need the Master service to be running.
2015-10-26 20:42:02 +00:00
Use the [`examples/spark/spark-worker-controller.yaml` ](spark-worker-controller.yaml ) file to create a
2015-07-14 16:37:37 +00:00
[replication controller ](../../docs/user-guide/replication-controller.md ) that manages the worker pods.
2015-03-18 14:11:17 +00:00
2015-10-26 20:42:02 +00:00
```console
$ kubectl create -f examples/spark/spark-worker-controller.yaml
2016-04-18 16:34:47 +00:00
replicationcontroller "spark-worker-controller" created
2015-03-18 14:11:17 +00:00
```
### Check to see if the workers are running
2015-10-26 20:42:02 +00:00
If you launched the Spark WebUI, your workers should just appear in the UI when
they're ready. (It may take a little bit to pull the images and launch the
pods.) You can also interrogate the status in the following way:
```console
2015-06-26 16:00:42 +00:00
$ kubectl get pods
2015-10-26 20:42:02 +00:00
NAME READY STATUS RESTARTS AGE
spark-master-controller-5u0q5 1/1 Running 0 25m
spark-worker-controller-e8otp 1/1 Running 0 6m
spark-worker-controller-fiivl 1/1 Running 0 6m
spark-worker-controller-ytc7o 1/1 Running 0 6m
2015-06-26 16:00:42 +00:00
2015-10-26 20:42:02 +00:00
$ kubectl logs spark-master-controller-5u0q5
2015-06-26 16:00:42 +00:00
[...]
2015-10-26 20:42:02 +00:00
15/10/26 18:20:14 INFO Master: Registering worker 10.244.1.13:53567 with 2 cores, 6.3 GB RAM
15/10/26 18:20:14 INFO Master: Registering worker 10.244.2.7:46195 with 2 cores, 6.3 GB RAM
15/10/26 18:20:14 INFO Master: Registering worker 10.244.3.8:39926 with 2 cores, 6.3 GB RAM
2015-03-18 14:11:17 +00:00
```
2016-04-18 16:34:47 +00:00
## Step Four: Start the Zeppelin UI to launch jobs on your Spark cluster
2015-07-10 09:19:55 +00:00
Zeppelin: Add Zeppelin image to Spark example
This adds a very basic Zeppelin image that works with the existing
Spark example. As can be seen from the documentation, it has a couple
of warts:
* It requires kubectl port-forward (which is unstable across long
periods of time, at least for me, on this app, bug incoming). See
* I needed to roll my own container (none of the existing containers
exactly matched needs, or even built anymore against modern Zeppelin
master, and the rest of the example is Spark 1.5).
The image itself is *huge*. One of the further refinements we need to
look at is how to possibly strip the Maven build for this container
down to just the interpreters we care about, because the deps here
are frankly ridiculous.
This might be a case where, if possible, we might want to open an
upstream request to build things dynamically, then use something like
probably the cut the image down considerably. (This might already be
possible, need to poke at whether you can late-bind interpreters
later.)
2015-11-05 22:56:15 +00:00
The Zeppelin UI pod can be used to launch jobs into the Spark cluster either via
a web notebook frontend or the traditional Spark command line. See
[Zeppelin ](https://zeppelin.incubator.apache.org/ ) and
[Spark architecture ](https://spark.apache.org/docs/latest/cluster-overview.html )
for more details.
2015-06-26 16:00:42 +00:00
2016-09-27 19:01:34 +00:00
Deploy Zeppelin:
2015-10-26 20:42:02 +00:00
```console
Zeppelin: Add Zeppelin image to Spark example
This adds a very basic Zeppelin image that works with the existing
Spark example. As can be seen from the documentation, it has a couple
of warts:
* It requires kubectl port-forward (which is unstable across long
periods of time, at least for me, on this app, bug incoming). See
* I needed to roll my own container (none of the existing containers
exactly matched needs, or even built anymore against modern Zeppelin
master, and the rest of the example is Spark 1.5).
The image itself is *huge*. One of the further refinements we need to
look at is how to possibly strip the Maven build for this container
down to just the interpreters we care about, because the deps here
are frankly ridiculous.
This might be a case where, if possible, we might want to open an
upstream request to build things dynamically, then use something like
probably the cut the image down considerably. (This might already be
possible, need to poke at whether you can late-bind interpreters
later.)
2015-11-05 22:56:15 +00:00
$ kubectl create -f examples/spark/zeppelin-controller.yaml
2016-04-18 16:34:47 +00:00
replicationcontroller "zeppelin-controller" created
2015-06-26 16:00:42 +00:00
```
2015-07-29 20:21:10 +00:00
2016-09-27 19:01:34 +00:00
And the corresponding service:
```console
$ kubectl create -f examples/spark/zeppelin-service.yaml
service "zeppelin" created
```
Zeppelin needs the spark-master service to be running.
2015-06-26 16:00:42 +00:00
Zeppelin: Add Zeppelin image to Spark example
This adds a very basic Zeppelin image that works with the existing
Spark example. As can be seen from the documentation, it has a couple
of warts:
* It requires kubectl port-forward (which is unstable across long
periods of time, at least for me, on this app, bug incoming). See
* I needed to roll my own container (none of the existing containers
exactly matched needs, or even built anymore against modern Zeppelin
master, and the rest of the example is Spark 1.5).
The image itself is *huge*. One of the further refinements we need to
look at is how to possibly strip the Maven build for this container
down to just the interpreters we care about, because the deps here
are frankly ridiculous.
This might be a case where, if possible, we might want to open an
upstream request to build things dynamically, then use something like
probably the cut the image down considerably. (This might already be
possible, need to poke at whether you can late-bind interpreters
later.)
2015-11-05 22:56:15 +00:00
### Check to see if Zeppelin is running
2015-03-18 14:11:17 +00:00
2015-10-26 20:42:02 +00:00
```console
2016-02-20 22:32:30 +00:00
$ kubectl get pods -l component=zeppelin
Zeppelin: Add Zeppelin image to Spark example
This adds a very basic Zeppelin image that works with the existing
Spark example. As can be seen from the documentation, it has a couple
of warts:
* It requires kubectl port-forward (which is unstable across long
periods of time, at least for me, on this app, bug incoming). See
* I needed to roll my own container (none of the existing containers
exactly matched needs, or even built anymore against modern Zeppelin
master, and the rest of the example is Spark 1.5).
The image itself is *huge*. One of the further refinements we need to
look at is how to possibly strip the Maven build for this container
down to just the interpreters we care about, because the deps here
are frankly ridiculous.
This might be a case where, if possible, we might want to open an
upstream request to build things dynamically, then use something like
probably the cut the image down considerably. (This might already be
possible, need to poke at whether you can late-bind interpreters
later.)
2015-11-05 22:56:15 +00:00
NAME READY STATUS RESTARTS AGE
zeppelin-controller-ja09s 1/1 Running 0 53s
2015-06-26 16:00:42 +00:00
```
2016-04-18 16:34:47 +00:00
## Step Five: Do something with the cluster
2015-03-18 14:11:17 +00:00
Zeppelin: Add Zeppelin image to Spark example
This adds a very basic Zeppelin image that works with the existing
Spark example. As can be seen from the documentation, it has a couple
of warts:
* It requires kubectl port-forward (which is unstable across long
periods of time, at least for me, on this app, bug incoming). See
* I needed to roll my own container (none of the existing containers
exactly matched needs, or even built anymore against modern Zeppelin
master, and the rest of the example is Spark 1.5).
The image itself is *huge*. One of the further refinements we need to
look at is how to possibly strip the Maven build for this container
down to just the interpreters we care about, because the deps here
are frankly ridiculous.
This might be a case where, if possible, we might want to open an
upstream request to build things dynamically, then use something like
probably the cut the image down considerably. (This might already be
possible, need to poke at whether you can late-bind interpreters
later.)
2015-11-05 22:56:15 +00:00
Now you have two choices, depending on your predilections. You can do something
graphical with the Spark cluster, or you can stay in the CLI.
2016-09-27 19:01:34 +00:00
For both choices, we will be working with this Python snippet:
```python
from math import sqrt; from itertools import count, islice
def isprime(n):
return n > 1 and all(n%i for i in islice(count(2), int(sqrt(n)-1)))
nums = sc.parallelize(xrange(10000000))
print nums.filter(isprime).count()
```
Zeppelin: Add Zeppelin image to Spark example
This adds a very basic Zeppelin image that works with the existing
Spark example. As can be seen from the documentation, it has a couple
of warts:
* It requires kubectl port-forward (which is unstable across long
periods of time, at least for me, on this app, bug incoming). See
* I needed to roll my own container (none of the existing containers
exactly matched needs, or even built anymore against modern Zeppelin
master, and the rest of the example is Spark 1.5).
The image itself is *huge*. One of the further refinements we need to
look at is how to possibly strip the Maven build for this container
down to just the interpreters we care about, because the deps here
are frankly ridiculous.
This might be a case where, if possible, we might want to open an
upstream request to build things dynamically, then use something like
probably the cut the image down considerably. (This might already be
possible, need to poke at whether you can late-bind interpreters
later.)
2015-11-05 22:56:15 +00:00
### Do something fast with pyspark!
2016-09-27 19:01:34 +00:00
Simply copy and paste the python snippet into pyspark from within the zeppelin pod:
2015-06-26 16:00:42 +00:00
2015-10-26 20:42:02 +00:00
```console
Zeppelin: Add Zeppelin image to Spark example
This adds a very basic Zeppelin image that works with the existing
Spark example. As can be seen from the documentation, it has a couple
of warts:
* It requires kubectl port-forward (which is unstable across long
periods of time, at least for me, on this app, bug incoming). See
* I needed to roll my own container (none of the existing containers
exactly matched needs, or even built anymore against modern Zeppelin
master, and the rest of the example is Spark 1.5).
The image itself is *huge*. One of the further refinements we need to
look at is how to possibly strip the Maven build for this container
down to just the interpreters we care about, because the deps here
are frankly ridiculous.
This might be a case where, if possible, we might want to open an
upstream request to build things dynamically, then use something like
probably the cut the image down considerably. (This might already be
possible, need to poke at whether you can late-bind interpreters
later.)
2015-11-05 22:56:15 +00:00
$ kubectl exec zeppelin-controller-ja09s -it pyspark
2015-07-25 18:57:24 +00:00
Python 2.7.9 (default, Mar 1 2015, 12:57:24)
2015-06-26 16:00:42 +00:00
[GCC 4.9.2] on linux2
2015-03-18 14:11:17 +00:00
Type "help", "copyright", "credits" or "license" for more information.
Welcome to
____ __
/ __ /__ ___ _____ / /__
_\ \/ _ \/ _ `/ __ / '_/
2015-10-26 20:42:02 +00:00
/__ / .__/\_,_/_/ /_/\_\ version 1.5.1
2015-03-18 14:11:17 +00:00
/_/
2015-10-26 20:42:02 +00:00
2015-06-26 16:00:42 +00:00
Using Python version 2.7.9 (default, Mar 1 2015 12:57:24)
SparkContext available as sc, HiveContext available as sqlContext.
2016-09-27 19:01:34 +00:00
>>> from math import sqrt; from itertools import count, islice
>>>
>>> def isprime(n):
... return n > 1 and all(n%i for i in islice(count(2), int(sqrt(n)-1)))
...
>>> nums = sc.parallelize(xrange(10000000))
>>> print nums.filter(isprime).count()
664579
2015-03-18 14:11:17 +00:00
```
2015-07-17 02:01:02 +00:00
2016-09-27 19:01:34 +00:00
Congratulations, you now know how many prime numbers there are within the first 10 million numbers!
2015-10-26 20:42:02 +00:00
Zeppelin: Add Zeppelin image to Spark example
This adds a very basic Zeppelin image that works with the existing
Spark example. As can be seen from the documentation, it has a couple
of warts:
* It requires kubectl port-forward (which is unstable across long
periods of time, at least for me, on this app, bug incoming). See
* I needed to roll my own container (none of the existing containers
exactly matched needs, or even built anymore against modern Zeppelin
master, and the rest of the example is Spark 1.5).
The image itself is *huge*. One of the further refinements we need to
look at is how to possibly strip the Maven build for this container
down to just the interpreters we care about, because the deps here
are frankly ridiculous.
This might be a case where, if possible, we might want to open an
upstream request to build things dynamically, then use something like
probably the cut the image down considerably. (This might already be
possible, need to poke at whether you can late-bind interpreters
later.)
2015-11-05 22:56:15 +00:00
### Do something graphical and shiny!
2016-09-27 19:01:34 +00:00
Creating the Zeppelin service should have yielded you a Loadbalancer endpoint:
```console
$ kubectl get svc zeppelin -o wide
NAME CLUSTER-IP EXTERNAL-IP PORT(S) AGE SELECTOR
zeppelin 10.0.154.1 a596f143884da11e6839506c114532b5-121893930.us-east-1.elb.amazonaws.com 80/TCP 3m component=zeppelin
```
If your Kubernetes cluster does not have a Loadbalancer integration, then we will have to use port forwarding.
Take the Zeppelin pod from before and port-forward the WebUI port:
Zeppelin: Add Zeppelin image to Spark example
This adds a very basic Zeppelin image that works with the existing
Spark example. As can be seen from the documentation, it has a couple
of warts:
* It requires kubectl port-forward (which is unstable across long
periods of time, at least for me, on this app, bug incoming). See
* I needed to roll my own container (none of the existing containers
exactly matched needs, or even built anymore against modern Zeppelin
master, and the rest of the example is Spark 1.5).
The image itself is *huge*. One of the further refinements we need to
look at is how to possibly strip the Maven build for this container
down to just the interpreters we care about, because the deps here
are frankly ridiculous.
This might be a case where, if possible, we might want to open an
upstream request to build things dynamically, then use something like
probably the cut the image down considerably. (This might already be
possible, need to poke at whether you can late-bind interpreters
later.)
2015-11-05 22:56:15 +00:00
```console
$ kubectl port-forward zeppelin-controller-ja09s 8080:8080
```
This forwards `localhost` 8080 to container port 8080. You can then find
2016-09-27 19:01:34 +00:00
Zeppelin at [http://localhost:8080/ ](http://localhost:8080/ ).
Zeppelin: Add Zeppelin image to Spark example
This adds a very basic Zeppelin image that works with the existing
Spark example. As can be seen from the documentation, it has a couple
of warts:
* It requires kubectl port-forward (which is unstable across long
periods of time, at least for me, on this app, bug incoming). See
* I needed to roll my own container (none of the existing containers
exactly matched needs, or even built anymore against modern Zeppelin
master, and the rest of the example is Spark 1.5).
The image itself is *huge*. One of the further refinements we need to
look at is how to possibly strip the Maven build for this container
down to just the interpreters we care about, because the deps here
are frankly ridiculous.
This might be a case where, if possible, we might want to open an
upstream request to build things dynamically, then use something like
probably the cut the image down considerably. (This might already be
possible, need to poke at whether you can late-bind interpreters
later.)
2015-11-05 22:56:15 +00:00
2016-09-27 19:01:34 +00:00
Once you've loaded up the Zeppelin UI, create a "New Notebook". In there we will paste our python snippet, but we need to add a `%pyspark` hint for Zeppelin to understand it:
Zeppelin: Add Zeppelin image to Spark example
This adds a very basic Zeppelin image that works with the existing
Spark example. As can be seen from the documentation, it has a couple
of warts:
* It requires kubectl port-forward (which is unstable across long
periods of time, at least for me, on this app, bug incoming). See
* I needed to roll my own container (none of the existing containers
exactly matched needs, or even built anymore against modern Zeppelin
master, and the rest of the example is Spark 1.5).
The image itself is *huge*. One of the further refinements we need to
look at is how to possibly strip the Maven build for this container
down to just the interpreters we care about, because the deps here
are frankly ridiculous.
This might be a case where, if possible, we might want to open an
upstream request to build things dynamically, then use something like
probably the cut the image down considerably. (This might already be
possible, need to poke at whether you can late-bind interpreters
later.)
2015-11-05 22:56:15 +00:00
```
%pyspark
2016-09-27 19:01:34 +00:00
from math import sqrt; from itertools import count, islice
def isprime(n):
return n > 1 and all(n%i for i in islice(count(2), int(sqrt(n)-1)))
nums = sc.parallelize(xrange(10000000))
print nums.filter(isprime).count()
Zeppelin: Add Zeppelin image to Spark example
This adds a very basic Zeppelin image that works with the existing
Spark example. As can be seen from the documentation, it has a couple
of warts:
* It requires kubectl port-forward (which is unstable across long
periods of time, at least for me, on this app, bug incoming). See
* I needed to roll my own container (none of the existing containers
exactly matched needs, or even built anymore against modern Zeppelin
master, and the rest of the example is Spark 1.5).
The image itself is *huge*. One of the further refinements we need to
look at is how to possibly strip the Maven build for this container
down to just the interpreters we care about, because the deps here
are frankly ridiculous.
This might be a case where, if possible, we might want to open an
upstream request to build things dynamically, then use something like
probably the cut the image down considerably. (This might already be
possible, need to poke at whether you can late-bind interpreters
later.)
2015-11-05 22:56:15 +00:00
```
2016-09-27 19:01:34 +00:00
After pasting in our code, press shift+enter or click the play icon to the right of our snippet. The Spark job will run and once again we'll have our result!
2015-06-26 16:00:42 +00:00
## Result
2015-10-26 20:42:02 +00:00
You now have services and replication controllers for the Spark master, Spark
workers and Spark driver. You can take this example to the next step and start
using the Apache Spark cluster you just created, see
[Spark documentation ](https://spark.apache.org/documentation.html ) for more
information.
2015-03-18 14:11:17 +00:00
## tl;dr
2015-10-26 20:42:02 +00:00
```console
2015-10-28 18:38:03 +00:00
kubectl create -f examples/spark
2015-10-26 20:42:02 +00:00
```
2015-05-14 22:12:45 +00:00
2015-10-26 20:42:02 +00:00
After it's setup:
2015-07-10 09:19:55 +00:00
2015-10-26 20:42:02 +00:00
```console
kubectl get pods # Make sure everything is running
2016-09-27 19:01:34 +00:00
kubectl get svc -o wide # Get the Loadbalancer endpoints for spark-ui-proxy and zeppelin
2015-10-26 20:42:02 +00:00
```
2015-07-29 20:21:10 +00:00
2016-09-27 19:01:34 +00:00
At which point the Master UI and Zeppelin will be available at the URLs under the `EXTERNAL-IP` field.
Zeppelin: Add Zeppelin image to Spark example
This adds a very basic Zeppelin image that works with the existing
Spark example. As can be seen from the documentation, it has a couple
of warts:
* It requires kubectl port-forward (which is unstable across long
periods of time, at least for me, on this app, bug incoming). See
* I needed to roll my own container (none of the existing containers
exactly matched needs, or even built anymore against modern Zeppelin
master, and the rest of the example is Spark 1.5).
The image itself is *huge*. One of the further refinements we need to
look at is how to possibly strip the Maven build for this container
down to just the interpreters we care about, because the deps here
are frankly ridiculous.
This might be a case where, if possible, we might want to open an
upstream request to build things dynamically, then use something like
probably the cut the image down considerably. (This might already be
possible, need to poke at whether you can late-bind interpreters
later.)
2015-11-05 22:56:15 +00:00
2016-09-27 19:01:34 +00:00
You can also interact with the Spark cluster using the traditional `spark-shell` /
Zeppelin: Add Zeppelin image to Spark example
This adds a very basic Zeppelin image that works with the existing
Spark example. As can be seen from the documentation, it has a couple
of warts:
* It requires kubectl port-forward (which is unstable across long
periods of time, at least for me, on this app, bug incoming). See
* I needed to roll my own container (none of the existing containers
exactly matched needs, or even built anymore against modern Zeppelin
master, and the rest of the example is Spark 1.5).
The image itself is *huge*. One of the further refinements we need to
look at is how to possibly strip the Maven build for this container
down to just the interpreters we care about, because the deps here
are frankly ridiculous.
This might be a case where, if possible, we might want to open an
upstream request to build things dynamically, then use something like
probably the cut the image down considerably. (This might already be
possible, need to poke at whether you can late-bind interpreters
later.)
2015-11-05 22:56:15 +00:00
`spark-subsubmit` / `pyspark` commands by using `kubectl exec` against the
2016-09-27 19:01:34 +00:00
`zeppelin-controller` pod.
If your Kubernetes cluster does not have a Loadbalancer integration, use `kubectl proxy` and `kubectl port-forward` to access the Spark UI and Zeppelin.
For Spark UI:
```console
kubectl proxy --port=8001
```
Then visit [http://localhost:8001/api/v1/proxy/namespaces/spark-cluster/services/spark-ui-proxy/ ](http://localhost:8001/api/v1/proxy/namespaces/spark-cluster/services/spark-ui-proxy/ ).
For Zeppelin:
Zeppelin: Add Zeppelin image to Spark example
This adds a very basic Zeppelin image that works with the existing
Spark example. As can be seen from the documentation, it has a couple
of warts:
* It requires kubectl port-forward (which is unstable across long
periods of time, at least for me, on this app, bug incoming). See
* I needed to roll my own container (none of the existing containers
exactly matched needs, or even built anymore against modern Zeppelin
master, and the rest of the example is Spark 1.5).
The image itself is *huge*. One of the further refinements we need to
look at is how to possibly strip the Maven build for this container
down to just the interpreters we care about, because the deps here
are frankly ridiculous.
This might be a case where, if possible, we might want to open an
upstream request to build things dynamically, then use something like
probably the cut the image down considerably. (This might already be
possible, need to poke at whether you can late-bind interpreters
later.)
2015-11-05 22:56:15 +00:00
```console
kubectl port-forward zeppelin-controller-abc123 8080:8080 &
```
Then visit [http://localhost:8080/ ](http://localhost:8080/ ).
## Known Issues With Spark
* This provides a Spark configuration that is restricted to the cluster network,
meaning the Spark master is only available as a cluster service. If you need
to submit jobs using external client other than Zeppelin or `spark-submit` on
the `zeppelin` pod, you will need to provide a way for your clients to get to
the
[`examples/spark/spark-master-service.yaml` ](spark-master-service.yaml ). See
[Services ](../../docs/user-guide/services.md ) for more information.
## Known Issues With Zeppelin
* The Zeppelin pod is large, so it may take a while to pull depending on your
network. The size of the Zeppelin pod is something we're working on, see issue #17231 .
* Zeppelin may take some time (about a minute) on this pipeline the first time
you run it. It seems to take considerable time to load.
* On GKE, `kubectl port-forward` may not be stable over long periods of time. If
you see Zeppelin go into `Disconnected` state (there will be a red dot on the
top right as well), the `port-forward` probably failed and needs to be
restarted. See #12179 .
2015-07-14 00:11:22 +00:00
<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
2015-05-14 22:12:45 +00:00
[![Analytics ](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/examples/spark/README.md?pixel )]()
2015-07-14 00:11:22 +00:00
<!-- END MUNGE: GENERATED_ANALYTICS -->