环境配置

Introduction

Tutorials & Docs

Quick Start

Deploy

Spark 的部署方式可以独立部署,也可以基于 Yarn 或者 Mesos 这样的资源调度框架进行部署。

[Spark 下载地址][1]

StandAlone

将 Spark 的程序文件解压之后可以通过如下命令直接启动一个独立的主机:

./sbin/start-master.sh

执行该命令之后 Spark 会自动执行 jetty 命令启动服务器,同时在命令行或者 log 日志文件中打印出系统地址,譬如:

15/05/28 13:20:57 INFO Master: Starting Spark master at spark://localhost.localdomain:7077
15/05/28 13:21:07 INFO MasterWebUI: Started MasterWebUI at http://192.168.199.166:8080

给出的这个 spark://HOST:PORT 地址可以供 Spark work 节点连接或者作为 master 的参数传入到 SparkContext 中。

需要启动 work 进程并在 master 中完成注册,可以使用如下命令:

./bin/spark-class org.apache.spark.deploy.worker.Worker spark://IP:PORT

注意,这边的 IP 地址实际上指的是启动 master 时候他监听的域名而来的 IP 地址。

![enter description here][2]

执行上述命令时可以传入的参数为:

Argument

Meaning

-h HOST, --host HOST

Hostname to listen on

-i HOST, --ip HOST

Hostname to listen on (deprecated, use -h or --host)

-p PORT, --port PORT

Port for service to listen on (default: 7077 for master, random for worker)

--webui-port PORT

Port for web UI (default: 8080 for master, 8081 for worker)

-c CORES, --cores CORES

Total CPU cores to allow Spark applications to use on the machine (default: all available); only on worker

-m MEM, --memory MEM

Total amount of memory to allow Spark applications to use on the machine, in a format like 1000M or 2G (default: your machine's total RAM minus 1 GB); only on worker

-d DIR, --work-dir DIR

Directory to use for scratch space and job output logs (default: SPARK_HOME/work); only on worker

--properties-file FILE

Path to a custom Spark properties file to load (default: conf/spark-defaults.conf)

Cluster Launch

  • sbin/start-master.sh - Starts a master instance on the machine the script is executed on.

  • sbin/start-slaves.sh - Starts a slave instance on each machine specified in the conf/slaves file.

  • sbin/start-all.sh - Starts both a master and a number of slaves as described above.

  • sbin/stop-master.sh - Stops the master that was started via the bin/start-master.sh script.

  • sbin/stop-slaves.sh - Stops all slave instances on the machines specified in the conf/slaves file.

  • sbin/stop-all.sh - Stops both the master and the slaves as described above.

Docker

基于 Docker 的 Spark 集群搭建

Application Submit

./bin/spark-submit \
--class <main-class>
--master <master-url> \
--deploy-mode <deploy-mode> \
--conf <key>=<value> \
... # other options
<application-jar> \
[application-arguments]

参数如下:

  • --class: The entry point for your application (e.g. org.apache.spark.examples.SparkPi)

  • --master: The master URL for the cluster (e.g. spark://23.195.26.187:7077)

  • --deploy-mode: Whether to deploy your driver on the worker nodes (cluster) or locally as an external client (client) (default: client)

  • --conf: Arbitrary Spark configuration property in key=value format. For values that contain spaces wrap “key=value” in quotes (as shown).

  • application-jar: Path to a bundled jar including your application and all dependencies. The URL must be globally visible inside of your cluster, for instance, an hdfs:// path or a file:// path that is present on all nodes.

  • application-arguments: Arguments passed to the main method of your main class, if any

StandAlone

# Run application locally on 8 cores
./bin/spark-submit \
--class org.apache.spark.examples.SparkPi \
--master local[8] \
/path/to/examples.jar \
100
# Run on a Spark Standalone cluster in client deploy mode
./bin/spark-submit \
--class org.apache.spark.examples.SparkPi \
--master spark://207.184.161.138:7077 \
--executor-memory 20G \
--total-executor-cores 100 \
/path/to/examples.jar \
1000
# Run on a Spark Standalone cluster in cluster deploy mode with supervise
./bin/spark-submit \
--class org.apache.spark.examples.SparkPi \
--master spark://207.184.161.138:7077 \
--deploy-mode cluster
--supervise
--executor-memory 20G \
--total-executor-cores 100 \
/path/to/examples.jar \
1000

在任务提交之后,在管理界面会显示:

![enter description here][3]

Program

Initializing Spark

编写 Spark 程序的首先是需要创建一个 JavaSparkContext 对象,用于确定如何去连接到一个集群中。而如果需要创建一个 SparkContext 对象则需要新创建一个包含应用的基本信息 SparkConf 对象。

SparkConf conf = new SparkConf().setAppName(appName).setMaster(master);
JavaSparkContext sc = new JavaSparkContext(conf);

上述代码中的 appName 是该应用的名称,而 master 指 Spark 进程的 URL,如果对于 StandAlone 进程既是类似于 spark://Host:Port

Resilient Distributed Datasets (RDDs)

Parallelized Collections

List<Integer> data = Arrays.asList(1, 2, 3, 4, 5);
JavaRDD<Integer> distData = sc.parallelize(data);

External Datasets

RDD Operations

注意,由于 Spark 中采用了大量的异步操作,并不能像普通的 Java 程序中一样去同步进行遍历,大量的遍历等操作是利用类似于回调的方式构造的。