31b923c987
mention the spark cluster is standalone add detailed master & worker instructions add method to get master status add links option for master status add links option for worker status add example use of cluster add source location |
||
---|---|---|
.. | ||
README.md | ||
spark-master-service.json | ||
spark-master.json | ||
spark-worker-controller.json |
README.md
Spark example
Following this example, you will create a functional Apache Spark cluster using Kubernetes and Docker.
You will setup a Spark master service and a set of Spark workers using Spark's standalone mode.
For the impatient expert, jump straight to the tl;dr section.
Sources
Source is freely available at:
- Docker image - https://github.com/mattf/docker-spark
- Docker Trusted Build - https://registry.hub.docker.com/search?q=mattf/spark
Step Zero: Prerequisites
This example assumes you have a Kubernetes cluster installed and
running, and that you have installed the kubectl
command line
tool somewhere in your path. Please see the getting
started for installation
instructions for your platform.
Step One: Start your Master service
The Master service is the master (or head) service for a Spark cluster.
Use the examples/spark/spark-master.json
file to create a pod running
the Master service.
$ kubectl create -f examples/spark/spark-master.json
Then, use the examples/spark/spark-master-service.json
file to
create a logical service endpoint that Spark workers can use to access
the Master pod.
$ kubectl create -f examples/spark/spark-master-service.json
Ensure that the Master service is running and functional.
Check to see if Master is running and accessible
$ kubectl get pods,services
POD IP CONTAINER(S) IMAGE(S) HOST LABELS STATUS
spark-master 192.168.90.14 spark-master mattf/spark-master 172.18.145.8/172.18.145.8 name=spark-master Running
NAME LABELS SELECTOR IP PORT
kubernetes component=apiserver,provider=kubernetes <none> 10.254.0.2 443
kubernetes-ro component=apiserver,provider=kubernetes <none> 10.254.0.1 80
spark-master name=spark-master name=spark-master 10.254.125.166 7077
Connect to http://192.168.90.14:8080 to see the status of the master.
$ links -dump 192.168.90.14:8080
[IMG] 1.2.1 Spark Master at spark://spark-master:7077
* URL: spark://spark-master:7077
* Workers: 0
* Cores: 0 Total, 0 Used
* Memory: 0.0 B Total, 0.0 B Used
* Applications: 0 Running, 0 Completed
* Drivers: 0 Running, 0 Completed
* Status: ALIVE
...
(Pull requests welcome for an alternative that uses the service IP and port)
Step Two: Start your Spark workers
The Spark workers do the heavy lifting in a Spark cluster. They provide execution resources and data cache capabilities for your program.
The Spark workers need the Master service to be running.
Use the examples/spark/spark-worker-controller.json
file to create a
ReplicationController that manages the worker pods.
$ kubectl create -f examples/spark/spark-worker-controller.json
Check to see if the workers are running
$ links -dump 192.168.90.14:8080
[IMG] 1.2.1 Spark Master at spark://spark-master:7077
* URL: spark://spark-master:7077
* Workers: 3
* Cores: 12 Total, 0 Used
* Memory: 20.4 GB Total, 0.0 B Used
* Applications: 0 Running, 0 Completed
* Drivers: 0 Running, 0 Completed
* Status: ALIVE
Workers
Id Address State Cores Memory
4 (0 6.8 GB
worker-20150318151745-192.168.75.14-46422 192.168.75.14:46422 ALIVE Used) (0.0 B
Used)
4 (0 6.8 GB
worker-20150318151746-192.168.35.17-53654 192.168.35.17:53654 ALIVE Used) (0.0 B
Used)
4 (0 6.8 GB
worker-20150318151746-192.168.90.17-50490 192.168.90.17:50490 ALIVE Used) (0.0 B
Used)
...
(Pull requests welcome for an alternative that uses the service IP and port)
Step Three: Do something with the cluster
$ kubectl get pods,services
POD IP CONTAINER(S) IMAGE(S) HOST LABELS STATUS
spark-master 192.168.90.14 spark-master mattf/spark-master 172.18.145.8/172.18.145.8 name=spark-master Running
spark-worker-controller-51wgg 192.168.75.14 spark-worker mattf/spark-worker 172.18.145.9/172.18.145.9 name=spark-worker,uses=spark-master Running
spark-worker-controller-5v48c 192.168.90.17 spark-worker mattf/spark-worker 172.18.145.8/172.18.145.8 name=spark-worker,uses=spark-master Running
spark-worker-controller-ehq23 192.168.35.17 spark-worker mattf/spark-worker 172.18.145.12/172.18.145.12 name=spark-worker,uses=spark-master Running
NAME LABELS SELECTOR IP PORT
kubernetes component=apiserver,provider=kubernetes <none> 10.254.0.2 443
kubernetes-ro component=apiserver,provider=kubernetes <none> 10.254.0.1 80
spark-master name=spark-master name=spark-master 10.254.125.166 7077
$ sudo docker run -it mattf/spark-base sh
sh-4.2# echo "10.254.125.166 spark-master" >> /etc/hosts
sh-4.2# export SPARK_LOCAL_HOSTNAME=$(hostname -i)
sh-4.2# MASTER=spark://spark-master:7077 pyspark
Python 2.7.5 (default, Jun 17 2014, 18:11:42)
[GCC 4.8.2 20140120 (Red Hat 4.8.2-16)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/__ / .__/\_,_/_/ /_/\_\ version 1.2.1
/_/
Using Python version 2.7.5 (default, Jun 17 2014 18:11:42)
SparkContext available as sc.
>>> import socket, resource
>>> sc.parallelize(range(1000)).map(lambda x: (socket.gethostname(), resource.getrlimit(resource.RLIMIT_NOFILE))).distinct().collect()
[('spark-worker-controller-ehq23', (1048576, 1048576)), ('spark-worker-controller-5v48c', (1048576, 1048576)), ('spark-worker-controller-51wgg', (1048576, 1048576))]
tl;dr
kubectl create -f spark-master.json
kubectl create -f spark-master-service.json
Make sure the Master Pod is running (use: kubectl get pods
).
kubectl create -f spark-worker-controller.json