2015-03-18 14:11:17 +00:00
|
|
|
# Spark example
|
|
|
|
|
|
|
|
Following this example, you will create a functional [Apache
|
|
|
|
Spark](http://spark.apache.org/) cluster using Kubernetes and
|
|
|
|
[Docker](http://docker.io).
|
|
|
|
|
|
|
|
You will setup a Spark master service and a set of
|
|
|
|
Spark workers using Spark's [standalone mode](http://spark.apache.org/docs/latest/spark-standalone.html).
|
|
|
|
|
|
|
|
For the impatient expert, jump straight to the [tl;dr](#tldr)
|
|
|
|
section.
|
|
|
|
|
|
|
|
### Sources
|
|
|
|
|
|
|
|
Source is freely available at:
|
|
|
|
* Docker image - https://github.com/mattf/docker-spark
|
|
|
|
* Docker Trusted Build - https://registry.hub.docker.com/search?q=mattf/spark
|
|
|
|
|
|
|
|
## Step Zero: Prerequisites
|
|
|
|
|
|
|
|
This example assumes you have a Kubernetes cluster installed and
|
|
|
|
running, and that you have installed the ```kubectl``` command line
|
|
|
|
tool somewhere in your path. Please see the [getting
|
|
|
|
started](../../docs/getting-started-guides) for installation
|
|
|
|
instructions for your platform.
|
|
|
|
|
|
|
|
## Step One: Start your Master service
|
|
|
|
|
|
|
|
The Master service is the master (or head) service for a Spark
|
|
|
|
cluster.
|
|
|
|
|
|
|
|
Use the `examples/spark/spark-master.json` file to create a pod running
|
|
|
|
the Master service.
|
|
|
|
|
|
|
|
```shell
|
|
|
|
$ kubectl create -f examples/spark/spark-master.json
|
|
|
|
```
|
|
|
|
|
|
|
|
Then, use the `examples/spark/spark-master-service.json` file to
|
|
|
|
create a logical service endpoint that Spark workers can use to access
|
|
|
|
the Master pod.
|
|
|
|
|
|
|
|
```shell
|
|
|
|
$ kubectl create -f examples/spark/spark-master-service.json
|
|
|
|
```
|
|
|
|
|
|
|
|
Ensure that the Master service is running and functional.
|
|
|
|
|
|
|
|
### Check to see if Master is running and accessible
|
|
|
|
|
|
|
|
```shell
|
|
|
|
$ kubectl get pods,services
|
|
|
|
POD IP CONTAINER(S) IMAGE(S) HOST LABELS STATUS
|
|
|
|
spark-master 192.168.90.14 spark-master mattf/spark-master 172.18.145.8/172.18.145.8 name=spark-master Running
|
|
|
|
NAME LABELS SELECTOR IP PORT
|
|
|
|
kubernetes component=apiserver,provider=kubernetes <none> 10.254.0.2 443
|
|
|
|
kubernetes-ro component=apiserver,provider=kubernetes <none> 10.254.0.1 80
|
|
|
|
spark-master name=spark-master name=spark-master 10.254.125.166 7077
|
|
|
|
```
|
|
|
|
|
|
|
|
Connect to http://192.168.90.14:8080 to see the status of the master.
|
|
|
|
|
|
|
|
```shell
|
|
|
|
$ links -dump 192.168.90.14:8080
|
|
|
|
[IMG] 1.2.1 Spark Master at spark://spark-master:7077
|
|
|
|
|
|
|
|
* URL: spark://spark-master:7077
|
|
|
|
* Workers: 0
|
|
|
|
* Cores: 0 Total, 0 Used
|
|
|
|
* Memory: 0.0 B Total, 0.0 B Used
|
|
|
|
* Applications: 0 Running, 0 Completed
|
|
|
|
* Drivers: 0 Running, 0 Completed
|
|
|
|
* Status: ALIVE
|
|
|
|
...
|
|
|
|
```
|
|
|
|
|
|
|
|
(Pull requests welcome for an alternative that uses the service IP and
|
|
|
|
port)
|
|
|
|
|
|
|
|
## Step Two: Start your Spark workers
|
|
|
|
|
|
|
|
The Spark workers do the heavy lifting in a Spark cluster. They
|
|
|
|
provide execution resources and data cache capabilities for your
|
|
|
|
program.
|
|
|
|
|
|
|
|
The Spark workers need the Master service to be running.
|
|
|
|
|
|
|
|
Use the `examples/spark/spark-worker-controller.json` file to create a
|
|
|
|
ReplicationController that manages the worker pods.
|
|
|
|
|
|
|
|
```shell
|
|
|
|
$ kubectl create -f examples/spark/spark-worker-controller.json
|
|
|
|
```
|
|
|
|
|
|
|
|
### Check to see if the workers are running
|
|
|
|
|
|
|
|
```shell
|
|
|
|
$ links -dump 192.168.90.14:8080
|
|
|
|
[IMG] 1.2.1 Spark Master at spark://spark-master:7077
|
|
|
|
|
|
|
|
* URL: spark://spark-master:7077
|
|
|
|
* Workers: 3
|
|
|
|
* Cores: 12 Total, 0 Used
|
|
|
|
* Memory: 20.4 GB Total, 0.0 B Used
|
|
|
|
* Applications: 0 Running, 0 Completed
|
|
|
|
* Drivers: 0 Running, 0 Completed
|
|
|
|
* Status: ALIVE
|
|
|
|
|
|
|
|
Workers
|
|
|
|
|
|
|
|
Id Address State Cores Memory
|
|
|
|
4 (0 6.8 GB
|
|
|
|
worker-20150318151745-192.168.75.14-46422 192.168.75.14:46422 ALIVE Used) (0.0 B
|
|
|
|
Used)
|
|
|
|
4 (0 6.8 GB
|
|
|
|
worker-20150318151746-192.168.35.17-53654 192.168.35.17:53654 ALIVE Used) (0.0 B
|
|
|
|
Used)
|
|
|
|
4 (0 6.8 GB
|
|
|
|
worker-20150318151746-192.168.90.17-50490 192.168.90.17:50490 ALIVE Used) (0.0 B
|
|
|
|
Used)
|
|
|
|
...
|
|
|
|
```
|
|
|
|
|
|
|
|
(Pull requests welcome for an alternative that uses the service IP and
|
|
|
|
port)
|
|
|
|
|
|
|
|
## Step Three: Do something with the cluster
|
|
|
|
|
|
|
|
```shell
|
|
|
|
$ kubectl get pods,services
|
|
|
|
POD IP CONTAINER(S) IMAGE(S) HOST LABELS STATUS
|
|
|
|
spark-master 192.168.90.14 spark-master mattf/spark-master 172.18.145.8/172.18.145.8 name=spark-master Running
|
|
|
|
spark-worker-controller-51wgg 192.168.75.14 spark-worker mattf/spark-worker 172.18.145.9/172.18.145.9 name=spark-worker,uses=spark-master Running
|
|
|
|
spark-worker-controller-5v48c 192.168.90.17 spark-worker mattf/spark-worker 172.18.145.8/172.18.145.8 name=spark-worker,uses=spark-master Running
|
|
|
|
spark-worker-controller-ehq23 192.168.35.17 spark-worker mattf/spark-worker 172.18.145.12/172.18.145.12 name=spark-worker,uses=spark-master Running
|
|
|
|
NAME LABELS SELECTOR IP PORT
|
|
|
|
kubernetes component=apiserver,provider=kubernetes <none> 10.254.0.2 443
|
|
|
|
kubernetes-ro component=apiserver,provider=kubernetes <none> 10.254.0.1 80
|
|
|
|
spark-master name=spark-master name=spark-master 10.254.125.166 7077
|
|
|
|
|
|
|
|
$ sudo docker run -it mattf/spark-base sh
|
|
|
|
|
|
|
|
sh-4.2# echo "10.254.125.166 spark-master" >> /etc/hosts
|
|
|
|
|
|
|
|
sh-4.2# export SPARK_LOCAL_HOSTNAME=$(hostname -i)
|
|
|
|
|
|
|
|
sh-4.2# MASTER=spark://spark-master:7077 pyspark
|
|
|
|
Python 2.7.5 (default, Jun 17 2014, 18:11:42)
|
|
|
|
[GCC 4.8.2 20140120 (Red Hat 4.8.2-16)] on linux2
|
|
|
|
Type "help", "copyright", "credits" or "license" for more information.
|
|
|
|
Welcome to
|
|
|
|
____ __
|
|
|
|
/ __/__ ___ _____/ /__
|
|
|
|
_\ \/ _ \/ _ `/ __/ '_/
|
|
|
|
/__ / .__/\_,_/_/ /_/\_\ version 1.2.1
|
|
|
|
/_/
|
|
|
|
|
|
|
|
Using Python version 2.7.5 (default, Jun 17 2014 18:11:42)
|
|
|
|
SparkContext available as sc.
|
|
|
|
>>> import socket, resource
|
|
|
|
>>> sc.parallelize(range(1000)).map(lambda x: (socket.gethostname(), resource.getrlimit(resource.RLIMIT_NOFILE))).distinct().collect()
|
|
|
|
[('spark-worker-controller-ehq23', (1048576, 1048576)), ('spark-worker-controller-5v48c', (1048576, 1048576)), ('spark-worker-controller-51wgg', (1048576, 1048576))]
|
|
|
|
```
|
|
|
|
|
|
|
|
## tl;dr
|
|
|
|
|
|
|
|
```kubectl create -f spark-master.json```
|
|
|
|
|
|
|
|
```kubectl create -f spark-master-service.json```
|
|
|
|
|
|
|
|
Make sure the Master Pod is running (use: ```kubectl get pods```).
|
|
|
|
|
|
|
|
```kubectl create -f spark-worker-controller.json```
|
2015-05-14 22:12:45 +00:00
|
|
|
|
|
|
|
|
|
|
|
[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/examples/spark/README.md?pixel)]()
|