Merge pull request #10407 from mwielgus/spark-example

Updated spark example with docker images moved to gcr.io/google_containers
pull/6/head
Piotr Szczesniak 2015-07-08 11:23:06 +02:00
commit b24b251cb1
12 changed files with 226 additions and 78 deletions

View File

@ -12,9 +12,7 @@ section.
### Sources
Source is freely available at:
* Docker image - https://github.com/mattf/docker-spark
* Docker Trusted Build - https://registry.hub.docker.com/search?q=mattf/spark
The Docker images are heavily based on https://github.com/mattf/docker-spark
## Step Zero: Prerequisites
@ -36,7 +34,7 @@ the Master service.
$ kubectl create -f examples/spark/spark-master.json
```
Then, use the `examples/spark/spark-master-service.json` file to
Then, use the [`examples/spark/spark-master-service.json`](spar-master-service.json) file to
create a logical service endpoint that Spark workers can use to access
the Master pod.
@ -44,38 +42,42 @@ the Master pod.
$ kubectl create -f examples/spark/spark-master-service.json
```
Ensure that the Master service is running and functional.
### Check to see if Master is running and accessible
```shell
$ kubectl get pods,services
POD IP CONTAINER(S) IMAGE(S) HOST LABELS STATUS
spark-master 192.168.90.14 spark-master mattf/spark-master 172.18.145.8/172.18.145.8 name=spark-master Running
NAME LABELS SELECTOR IP PORT
kubernetes component=apiserver,provider=kubernetes <none> 10.254.0.2 443
spark-master name=spark-master name=spark-master 10.254.125.166 7077
$ kubectl get pods
NAME READY REASON RESTARTS AGE
[...]
spark-master 1/1 Running 0 25s
```
Connect to http://192.168.90.14:8080 to see the status of the master.
Check logs to see the status of the master.
```shell
$ links -dump 192.168.90.14:8080
[IMG] 1.2.1 Spark Master at spark://spark-master:7077
$ kubectl logs spark-master
* URL: spark://spark-master:7077
* Workers: 0
* Cores: 0 Total, 0 Used
* Memory: 0.0 B Total, 0.0 B Used
* Applications: 0 Running, 0 Completed
* Drivers: 0 Running, 0 Completed
* Status: ALIVE
...
starting org.apache.spark.deploy.master.Master, logging to /opt/spark-1.4.0-bin-hadoop2.6/sbin/../logs/spark--org.apache.spark.deploy.master.Master-1-spark-master.out
Spark Command: /usr/lib/jvm/java-7-openjdk-amd64/jre/bin/java -cp /opt/spark-1.4.0-bin-hadoop2.6/sbin/../conf/:/opt/spark-1.4.0-bin-hadoop2.6/lib/spark-assembly-1.4.0-hadoop2.6.0.jar:/opt/spark-1.4.0-bin-hadoop2.6/lib/datanucleus-api-jdo-3.2.6.jar:/opt/spark-1.4.0-bin-hadoop2.6/lib/datanucleus-rdbms-3.2.9.jar:/opt/spark-1.4.0-bin-hadoop2.6/lib/datanucleus-core-3.2.10.jar -Xms512m -Xmx512m -XX:MaxPermSize=128m org.apache.spark.deploy.master.Master --ip spark-master --port 7077 --webui-port 8080
========================================
15/06/26 14:01:49 INFO Master: Registered signal handlers for [TERM, HUP, INT]
15/06/26 14:01:50 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
15/06/26 14:01:51 INFO SecurityManager: Changing view acls to: root
15/06/26 14:01:51 INFO SecurityManager: Changing modify acls to: root
15/06/26 14:01:51 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root); users with modify permissions: Set(root)
15/06/26 14:01:51 INFO Slf4jLogger: Slf4jLogger started
15/06/26 14:01:51 INFO Remoting: Starting remoting
15/06/26 14:01:52 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkMaster@spark-master:7077]
15/06/26 14:01:52 INFO Utils: Successfully started service 'sparkMaster' on port 7077.
15/06/26 14:01:52 INFO Utils: Successfully started service on port 6066.
15/06/26 14:01:52 INFO StandaloneRestServer: Started REST server for submitting applications on port 6066
15/06/26 14:01:52 INFO Master: Starting Spark master at spark://spark-master:7077
15/06/26 14:01:52 INFO Master: Running Spark version 1.4.0
15/06/26 14:01:52 INFO Utils: Successfully started service 'MasterUI' on port 8080.
15/06/26 14:01:52 INFO MasterWebUI: Started MasterWebUI at http://10.244.2.34:8080
15/06/26 14:01:53 INFO Master: I have been elected leader! New state: ALIVE
```
(Pull requests welcome for an alternative that uses the service IP and
port)
## Step Two: Start your Spark workers
The Spark workers do the heavy lifting in a Spark cluster. They
@ -94,71 +96,80 @@ $ kubectl create -f examples/spark/spark-worker-controller.json
### Check to see if the workers are running
```shell
$ links -dump 192.168.90.14:8080
[IMG] 1.2.1 Spark Master at spark://spark-master:7077
$ kubectl get pods
NAME READY REASON RESTARTS AGE
[...]
spark-master 1/1 Running 0 14m
spark-worker-controller-hifwi 1/1 Running 0 33s
spark-worker-controller-u40r2 1/1 Running 0 33s
spark-worker-controller-vpgyg 1/1 Running 0 33s
* URL: spark://spark-master:7077
* Workers: 3
* Cores: 12 Total, 0 Used
* Memory: 20.4 GB Total, 0.0 B Used
* Applications: 0 Running, 0 Completed
* Drivers: 0 Running, 0 Completed
* Status: ALIVE
Workers
Id Address State Cores Memory
4 (0 6.8 GB
worker-20150318151745-192.168.75.14-46422 192.168.75.14:46422 ALIVE Used) (0.0 B
Used)
4 (0 6.8 GB
worker-20150318151746-192.168.35.17-53654 192.168.35.17:53654 ALIVE Used) (0.0 B
Used)
4 (0 6.8 GB
worker-20150318151746-192.168.90.17-50490 192.168.90.17:50490 ALIVE Used) (0.0 B
Used)
...
$ kubectl logs spark-master
[...]
15/06/26 14:15:43 INFO Master: Registering worker 10.244.2.35:46199 with 1 cores, 2.6 GB RAM
15/06/26 14:15:55 INFO Master: Registering worker 10.244.1.15:44839 with 1 cores, 2.6 GB RAM
15/06/26 14:15:55 INFO Master: Registering worker 10.244.0.19:60970 with 1 cores, 2.6 GB RAM
```
(Pull requests welcome for an alternative that uses the service IP and
port)
## Step Three: Do something with the cluster
Get the address and port of the Master service.
```shell
$ kubectl get pods,services
POD IP CONTAINER(S) IMAGE(S) HOST LABELS STATUS
spark-master 192.168.90.14 spark-master mattf/spark-master 172.18.145.8/172.18.145.8 name=spark-master Running
spark-worker-controller-51wgg 192.168.75.14 spark-worker mattf/spark-worker 172.18.145.9/172.18.145.9 name=spark-worker,uses=spark-master Running
spark-worker-controller-5v48c 192.168.90.17 spark-worker mattf/spark-worker 172.18.145.8/172.18.145.8 name=spark-worker,uses=spark-master Running
spark-worker-controller-ehq23 192.168.35.17 spark-worker mattf/spark-worker 172.18.145.12/172.18.145.12 name=spark-worker,uses=spark-master Running
NAME LABELS SELECTOR IP PORT
kubernetes component=apiserver,provider=kubernetes <none> 10.254.0.2 443
spark-master name=spark-master name=spark-master 10.254.125.166 7077
$ kubectl get service spark-master
NAME LABELS SELECTOR IP(S) PORT(S)
spark-master name=spark-master name=spark-master 10.0.204.187 7077/TCP
```
$ sudo docker run -it mattf/spark-base sh
SSH to one of your cluster nodes. On GCE/GKE you can either use [Developers Console](https://console.developers.google.com)
(more details [here](https://cloud.google.com/compute/docs/ssh-in-browser))
or run `gcloud compute ssh <name>` where the name can be taken from `kubectl get nodes`
(more details [here](https://cloud.google.com/compute/docs/gcloud-compute/#connecting)).
sh-4.2# echo "10.254.125.166 spark-master" >> /etc/hosts
```
$ kubectl get nodes
NAME LABELS STATUS
kubernetes-minion-5jvu kubernetes.io/hostname=kubernetes-minion-5jvu Ready
kubernetes-minion-6fbi kubernetes.io/hostname=kubernetes-minion-6fbi Ready
kubernetes-minion-8y2v kubernetes.io/hostname=kubernetes-minion-8y2v Ready
kubernetes-minion-h0tr kubernetes.io/hostname=kubernetes-minion-h0tr Ready
sh-4.2# export SPARK_LOCAL_HOSTNAME=$(hostname -i)
$ gcloud compute ssh kubernetes-minion-5jvu --zone=us-central1-b
Linux kubernetes-minion-5jvu 3.16.0-0.bpo.4-amd64 #1 SMP Debian 3.16.7-ckt9-3~deb8u1~bpo70+1 (2015-04-27) x86_64
sh-4.2# MASTER=spark://spark-master:7077 pyspark
Python 2.7.5 (default, Jun 17 2014, 18:11:42)
[GCC 4.8.2 20140120 (Red Hat 4.8.2-16)] on linux2
=== GCE Kubernetes node setup complete ===
me@kubernetes-minion-5jvu:~$
```
Once logged in run spark-base image. Inside of the image there is a script
that sets up the environment based on the provided IP and port of the Master.
```
cluster-node $ sudo docker run -it gcr.io/google_containers/spark-base
root@f12a6fec45ce:/# . /setup_client.sh 10.0.204.187 7077
root@f12a6fec45ce:/# pyspark
Python 2.7.9 (default, Mar 1 2015, 12:57:24)
[GCC 4.9.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
15/06/26 14:25:28 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/__ / .__/\_,_/_/ /_/\_\ version 1.2.1
/__ / .__/\_,_/_/ /_/\_\ version 1.4.0
/_/
Using Python version 2.7.5 (default, Jun 17 2014 18:11:42)
SparkContext available as sc.
>>> import socket, resource
>>> sc.parallelize(range(1000)).map(lambda x: (socket.gethostname(), resource.getrlimit(resource.RLIMIT_NOFILE))).distinct().collect()
[('spark-worker-controller-ehq23', (1048576, 1048576)), ('spark-worker-controller-5v48c', (1048576, 1048576)), ('spark-worker-controller-51wgg', (1048576, 1048576))]
Using Python version 2.7.9 (default, Mar 1 2015 12:57:24)
SparkContext available as sc, HiveContext available as sqlContext.
>>> import socket
>>> sc.parallelize(range(1000)).map(lambda x:socket.gethostname()).distinct().collect()
['spark-worker-controller-u40r2', 'spark-worker-controller-hifwi', 'spark-worker-controller-vpgyg']
```
## Result
You now have services, replication controllers, and pods for the Spark master and Spark workers.
You can take this example to the next step and start using the Apache Spark cluster
you just created, see [Spark documentation](https://spark.apache.org/documentation.html)
for more information.
## tl;dr
@ -170,5 +181,4 @@ Make sure the Master Pod is running (use: ```kubectl get pods```).
```kubectl create -f spark-worker-controller.json```
[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/examples/spark/README.md?pixel)]()

View File

@ -0,0 +1,17 @@
FROM java:latest
RUN apt-get update -y
RUN apt-get install -y scala
# Get Spark from some apache mirror.
RUN mkdir -p /opt && \
cd /opt && \
wget http://apache.mirrors.pair.com/spark/spark-1.4.0/spark-1.4.0-bin-hadoop2.6.tgz && \
tar -zvxf spark-1.4.0-bin-hadoop2.6.tgz && \
rm spark-1.4.0-bin-hadoop2.6.tgz && \
ln -s spark-1.4.0-bin-hadoop2.6 spark && \
echo Spark installed in /opt
ADD log4j.properties /opt/spark/conf/log4j.properties
ADD setup_client.sh /
ENV PATH $PATH:/opt/spark/bin

View File

@ -0,0 +1,12 @@
# Set everything to be logged to the console
log4j.rootCategory=WARN, console
log4j.appender.console=org.apache.log4j.ConsoleAppender
log4j.appender.console.target=System.err
log4j.appender.console.layout=org.apache.log4j.PatternLayout
log4j.appender.console.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %c{1}: %m%n
# Settings to quiet third party logs that are too verbose
log4j.logger.org.spark-project.jetty=WARN
log4j.logger.org.spark-project.jetty.util.component.AbstractLifeCycle=ERROR
log4j.logger.org.apache.spark.repl.SparkIMain$exprTyper=INFO
log4j.logger.org.apache.spark.repl.SparkILoop$SparkILoopInterpreter=INFO

View File

@ -0,0 +1,24 @@
#!/bin/bash
# Copyright 2015 The Kubernetes Authors All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
if [[ $# != 2 || $1 == "" || $2 == "" ]]; then
echo "Usage: . ./setup_client.sh master_address master_port"
exit 1
fi
echo "$1 spark-master" >> /etc/hosts
export SPARK_LOCAL_HOSTNAME=$(hostname -i)
export MASTER=spark://spark-master:$2

View File

@ -0,0 +1,7 @@
FROM gcr.io/google_containers/spark-base
ADD start.sh /
ADD log4j.properties /opt/spark/conf/log4j.properties
EXPOSE 7077
ENTRYPOINT ["/start.sh"]

View File

@ -0,0 +1,12 @@
# Set everything to be logged to the console
log4j.rootCategory=INFO, console
log4j.appender.console=org.apache.log4j.ConsoleAppender
log4j.appender.console.target=System.err
log4j.appender.console.layout=org.apache.log4j.PatternLayout
log4j.appender.console.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %c{1}: %m%n
# Settings to quiet third party logs that are too verbose
log4j.logger.org.spark-project.jetty=WARN
log4j.logger.org.spark-project.jetty.util.component.AbstractLifeCycle=ERROR
log4j.logger.org.apache.spark.repl.SparkIMain$exprTyper=INFO
log4j.logger.org.apache.spark.repl.SparkILoop$SparkILoopInterpreter=INFO

View File

@ -0,0 +1,19 @@
#!/bin/bash
# Copyright 2015 The Kubernetes Authors All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
export SPARK_MASTER_PORT=${SPARK_MASTER_SERVICE_PORT:-7077}
/opt/spark/sbin/start-master.sh
tail -F /opt/spark/logs/*

View File

@ -0,0 +1,7 @@
FROM gcr.io/google_containers/spark-base
ADD start.sh /
ADD log4j.properties /opt/spark/conf/log4j.properties
EXPOSE 8080
ENTRYPOINT ["/start.sh"]

View File

@ -0,0 +1,12 @@
# Set everything to be logged to the console
log4j.rootCategory=INFO, console
log4j.appender.console=org.apache.log4j.ConsoleAppender
log4j.appender.console.target=System.err
log4j.appender.console.layout=org.apache.log4j.PatternLayout
log4j.appender.console.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %c{1}: %m%n
# Settings to quiet third party logs that are too verbose
log4j.logger.org.spark-project.jetty=WARN
log4j.logger.org.spark-project.jetty.util.component.AbstractLifeCycle=ERROR
log4j.logger.org.apache.spark.repl.SparkIMain$exprTyper=INFO
log4j.logger.org.apache.spark.repl.SparkILoop$SparkILoopInterpreter=INFO

View File

@ -0,0 +1,28 @@
#!/bin/bash
# Copyright 2015 The Kubernetes Authors All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
if [[ ${SPARK_MASTER_SERVICE_HOST} == "" ]]; then
echo "Spark Master service must be created before starting any workers"
sleep 30 # To postpone pod restart
exit 1
fi
echo "${SPARK_MASTER_SERVICE_HOST} spark-master" >> /etc/hosts
export SPARK_LOCAL_HOSTNAME=$(hostname -i)
/opt/spark/sbin/start-slave.sh spark://spark-master:${SPARK_MASTER_SERVICE_PORT}
tail -F /opt/spark/logs/*

View File

@ -11,7 +11,7 @@
"containers": [
{
"name": "spark-master",
"image": "mattf/spark-master",
"image": "gcr.io/google_containers/spark-master",
"ports": [
{
"containerPort": 7077

View File

@ -23,7 +23,7 @@
"containers": [
{
"name": "spark-worker",
"image": "mattf/spark-worker",
"image": "gcr.io/google_containers/spark-worker",
"ports": [
{
"hostPort": 8888,