2015-07-14 00:11:22 +00:00
<!-- BEGIN MUNGE: UNVERSIONED_WARNING -->
<!-- BEGIN STRIP_FOR_RELEASE -->
2015-07-16 17:02:26 +00:00
< img src = "http://kubernetes.io/img/warning.png" alt = "WARNING"
width="25" height="25">
< img src = "http://kubernetes.io/img/warning.png" alt = "WARNING"
width="25" height="25">
< img src = "http://kubernetes.io/img/warning.png" alt = "WARNING"
width="25" height="25">
< img src = "http://kubernetes.io/img/warning.png" alt = "WARNING"
width="25" height="25">
< img src = "http://kubernetes.io/img/warning.png" alt = "WARNING"
width="25" height="25">
< h2 > PLEASE NOTE: This document applies to the HEAD of the source tree< / h2 >
If you are using a released version of Kubernetes, you should
refer to the docs that go with that version.
< strong >
The latest 1.0.x release of this document can be found
[here ](http://releases.k8s.io/release-1.0/examples/spark/README.md ).
Documentation for other releases can be found at
[releases.k8s.io ](http://releases.k8s.io ).
< / strong >
--
2015-07-13 22:15:35 +00:00
2015-07-14 00:11:22 +00:00
<!-- END STRIP_FOR_RELEASE -->
<!-- END MUNGE: UNVERSIONED_WARNING -->
2015-07-17 22:35:41 +00:00
2015-03-18 14:11:17 +00:00
# Spark example
Following this example, you will create a functional [Apache
Spark](http://spark.apache.org/) cluster using Kubernetes and
[Docker ](http://docker.io ).
You will setup a Spark master service and a set of
Spark workers using Spark's [standalone mode ](http://spark.apache.org/docs/latest/spark-standalone.html ).
For the impatient expert, jump straight to the [tl;dr ](#tldr )
section.
### Sources
2015-06-26 16:00:42 +00:00
The Docker images are heavily based on https://github.com/mattf/docker-spark
2015-03-18 14:11:17 +00:00
## Step Zero: Prerequisites
This example assumes you have a Kubernetes cluster installed and
running, and that you have installed the ```kubectl``` command line
tool somewhere in your path. Please see the [getting
2015-07-14 00:11:22 +00:00
started](../../docs/getting-started-guides/) for installation
2015-03-18 14:11:17 +00:00
instructions for your platform.
## Step One: Start your Master service
2015-07-14 16:37:37 +00:00
The Master [service ](../../docs/user-guide/services.md ) is the master (or head) service for a Spark
2015-03-18 14:11:17 +00:00
cluster.
2015-07-14 16:37:37 +00:00
Use the [`examples/spark/spark-master.json` ](spark-master.json ) file to create a [pod ](../../docs/user-guide/pods.md ) running
2015-03-18 14:11:17 +00:00
the Master service.
2015-07-20 16:40:32 +00:00
```sh
2015-03-18 14:11:17 +00:00
$ kubectl create -f examples/spark/spark-master.json
```
2015-07-14 00:11:22 +00:00
Then, use the [`examples/spark/spark-master-service.json` ](spark-master-service.json ) file to
2015-03-18 14:11:17 +00:00
create a logical service endpoint that Spark workers can use to access
the Master pod.
2015-07-20 16:40:32 +00:00
```sh
2015-03-18 14:11:17 +00:00
$ kubectl create -f examples/spark/spark-master-service.json
```
### Check to see if Master is running and accessible
2015-07-20 16:40:32 +00:00
```sh
2015-06-26 16:00:42 +00:00
$ kubectl get pods
2015-07-08 05:52:52 +00:00
NAME READY STATUS RESTARTS AGE
2015-06-26 16:00:42 +00:00
[...]
spark-master 1/1 Running 0 25s
2015-03-18 14:11:17 +00:00
```
2015-06-26 16:00:42 +00:00
Check logs to see the status of the master.
2015-03-18 14:11:17 +00:00
2015-07-20 16:40:32 +00:00
```sh
2015-06-26 16:00:42 +00:00
$ kubectl logs spark-master
starting org.apache.spark.deploy.master.Master, logging to /opt/spark-1.4.0-bin-hadoop2.6/sbin/../logs/spark--org.apache.spark.deploy.master.Master-1-spark-master.out
Spark Command: /usr/lib/jvm/java-7-openjdk-amd64/jre/bin/java -cp /opt/spark-1.4.0-bin-hadoop2.6/sbin/../conf/:/opt/spark-1.4.0-bin-hadoop2.6/lib/spark-assembly-1.4.0-hadoop2.6.0.jar:/opt/spark-1.4.0-bin-hadoop2.6/lib/datanucleus-api-jdo-3.2.6.jar:/opt/spark-1.4.0-bin-hadoop2.6/lib/datanucleus-rdbms-3.2.9.jar:/opt/spark-1.4.0-bin-hadoop2.6/lib/datanucleus-core-3.2.10.jar -Xms512m -Xmx512m -XX:MaxPermSize=128m org.apache.spark.deploy.master.Master --ip spark-master --port 7077 --webui-port 8080
========================================
15/06/26 14:01:49 INFO Master: Registered signal handlers for [TERM, HUP, INT]
15/06/26 14:01:50 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
15/06/26 14:01:51 INFO SecurityManager: Changing view acls to: root
15/06/26 14:01:51 INFO SecurityManager: Changing modify acls to: root
15/06/26 14:01:51 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root); users with modify permissions: Set(root)
15/06/26 14:01:51 INFO Slf4jLogger: Slf4jLogger started
15/06/26 14:01:51 INFO Remoting: Starting remoting
15/06/26 14:01:52 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkMaster@spark-master:7077]
15/06/26 14:01:52 INFO Utils: Successfully started service 'sparkMaster' on port 7077.
15/06/26 14:01:52 INFO Utils: Successfully started service on port 6066.
15/06/26 14:01:52 INFO StandaloneRestServer: Started REST server for submitting applications on port 6066
15/06/26 14:01:52 INFO Master: Starting Spark master at spark://spark-master:7077
15/06/26 14:01:52 INFO Master: Running Spark version 1.4.0
15/06/26 14:01:52 INFO Utils: Successfully started service 'MasterUI' on port 8080.
15/06/26 14:01:52 INFO MasterWebUI: Started MasterWebUI at http://10.244.2.34:8080
15/06/26 14:01:53 INFO Master: I have been elected leader! New state: ALIVE
2015-03-18 14:11:17 +00:00
```
## Step Two: Start your Spark workers
The Spark workers do the heavy lifting in a Spark cluster. They
provide execution resources and data cache capabilities for your
program.
The Spark workers need the Master service to be running.
2015-05-24 07:15:58 +00:00
Use the [`examples/spark/spark-worker-controller.json` ](spark-worker-controller.json ) file to create a
2015-07-14 16:37:37 +00:00
[replication controller ](../../docs/user-guide/replication-controller.md ) that manages the worker pods.
2015-03-18 14:11:17 +00:00
2015-07-20 16:40:32 +00:00
```sh
2015-03-18 14:11:17 +00:00
$ kubectl create -f examples/spark/spark-worker-controller.json
```
### Check to see if the workers are running
2015-07-20 16:40:32 +00:00
```sh
2015-06-26 16:00:42 +00:00
$ kubectl get pods
2015-07-08 05:52:52 +00:00
NAME READY STATUS RESTARTS AGE
2015-06-26 16:00:42 +00:00
[...]
spark-master 1/1 Running 0 14m
spark-worker-controller-hifwi 1/1 Running 0 33s
spark-worker-controller-u40r2 1/1 Running 0 33s
spark-worker-controller-vpgyg 1/1 Running 0 33s
$ kubectl logs spark-master
[...]
15/06/26 14:15:43 INFO Master: Registering worker 10.244.2.35:46199 with 1 cores, 2.6 GB RAM
15/06/26 14:15:55 INFO Master: Registering worker 10.244.1.15:44839 with 1 cores, 2.6 GB RAM
15/06/26 14:15:55 INFO Master: Registering worker 10.244.0.19:60970 with 1 cores, 2.6 GB RAM
2015-03-18 14:11:17 +00:00
```
2015-07-10 09:19:55 +00:00
## Step Three: Start your Spark driver to launch jobs on your Spark cluster
The Spark driver is used to launch jobs into Spark cluster. You can read more about it in
[Spark architecture ](http://spark.apache.org/docs/latest/cluster-overview.html ).
2015-06-26 16:00:42 +00:00
2015-03-18 14:11:17 +00:00
```shell
2015-07-10 09:19:55 +00:00
$ kubectl create -f examples/spark/spark-driver.json
2015-06-26 16:00:42 +00:00
```
2015-07-29 20:21:10 +00:00
2015-07-10 09:19:55 +00:00
The Spark driver needs the Master service to be running.
2015-06-26 16:00:42 +00:00
2015-07-10 09:19:55 +00:00
### Check to see if the driver is running
2015-03-18 14:11:17 +00:00
2015-07-10 09:19:55 +00:00
```shell
$ kubectl get pods
NAME READY REASON RESTARTS AGE
[...]
spark-master 1/1 Running 0 14m
spark-driver 1/1 Running 0 10m
2015-06-26 16:00:42 +00:00
```
2015-07-10 09:19:55 +00:00
## Step Four: Do something with the cluster
2015-03-18 14:11:17 +00:00
2015-07-10 09:19:55 +00:00
Use the kubectl exec to connect to Spark driver
2015-06-26 16:00:42 +00:00
```
2015-07-10 09:19:55 +00:00
$ kubectl exec spark-driver -it bash
root@spark-driver:/#
root@spark-driver:/# pyspark
2015-07-25 18:57:24 +00:00
Python 2.7.9 (default, Mar 1 2015, 12:57:24)
2015-06-26 16:00:42 +00:00
[GCC 4.9.2] on linux2
2015-03-18 14:11:17 +00:00
Type "help", "copyright", "credits" or "license" for more information.
2015-06-26 16:00:42 +00:00
15/06/26 14:25:28 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2015-03-18 14:11:17 +00:00
Welcome to
____ __
/ __ /__ ___ _____ / /__
_\ \/ _ \/ _ `/ __ / '_/
2015-06-26 16:00:42 +00:00
/__ / .__/\_,_/_/ /_/\_\ version 1.4.0
2015-03-18 14:11:17 +00:00
/_/
2015-06-26 16:00:42 +00:00
Using Python version 2.7.9 (default, Mar 1 2015 12:57:24)
SparkContext available as sc, HiveContext available as sqlContext.
>>> import socket
>>> sc.parallelize(range(1000)).map(lambda x:socket.gethostname()).distinct().collect()
['spark-worker-controller-u40r2', 'spark-worker-controller-hifwi', 'spark-worker-controller-vpgyg']
2015-03-18 14:11:17 +00:00
```
2015-07-17 02:01:02 +00:00
2015-06-26 16:00:42 +00:00
## Result
2015-07-10 09:19:55 +00:00
You now have services, replication controllers, and pods for the Spark master , Spark driver and Spark workers.
2015-07-25 18:57:24 +00:00
You can take this example to the next step and start using the Apache Spark cluster
you just created, see [Spark documentation ](https://spark.apache.org/documentation.html )
2015-06-26 16:00:42 +00:00
for more information.
2015-03-18 14:11:17 +00:00
## tl;dr
```kubectl create -f spark-master.json```
```kubectl create -f spark-master-service.json```
Make sure the Master Pod is running (use: ```kubectl get pods```).
```kubectl create -f spark-worker-controller.json```
2015-05-14 22:12:45 +00:00
2015-07-10 09:19:55 +00:00
```kubectl create -f spark-driver.json```
2015-07-29 20:21:10 +00:00
2015-07-14 00:11:22 +00:00
<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
2015-05-14 22:12:45 +00:00
[![Analytics ](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/examples/spark/README.md?pixel )]()
2015-07-14 00:11:22 +00:00
<!-- END MUNGE: GENERATED_ANALYTICS -->