Running Alluxio YARN Integration

This guide explains the process for running Alluxio as an application in a YARN cluster. For a self-contained tutorial on running Alluxio + YARN on EC2, see this guide.

Note: YARN is not well-suited for long-running applications such as Alluxio. We recommend following these instructions instead of running Alluxio as a YARN application.

Prerequisites

A running YARN cluster

Alluxio downloaded locally

$ curl http://downloads.alluxio.org/downloads/files/1.5.0/alluxio-1.5.0-bin.tar.gz | tar xz

Build YARN Integration

$ mvn clean install -Dhadoop.version=<your hadoop version> -Pyarn -Dlicense.skip -DskipTests -Dfindbugs.skip -Dmaven.javadoc.skip -Dcheckstyle.skip

Make sure to replace with the version of Hadoop that you are using.

Configuration

To customize Alluxio master and worker with specific properties (e.g., tiered storage setup on each worker), see Configuration settings. To ensure your configuration can be read by both the ApplicationMaster and Alluxio master/workers, put alluxio-site.properties in /etc/alluxio/alluxio-site.properties.

If Yarn does not reside in HADOOP_HOME, set the environment variable YARN_HOME to the base path of Yarn.

Run Alluxio Application

Use the script integration/yarn/bin/alluxio-yarn.sh to start Alluxio. This script takes three arguments:

  1. The total number of Alluxio workers to start. (required)
  2. An HDFS path to distribute the binaries for Alluxio ApplicationMaster. (required)
  3. The Yarn name for the node on which to run the Alluxio Master (optional, defaults to ${ALLUXIO_MASTER_HOSTNAME})

For example, to launch an Alluxio cluster with 3 worker nodes, where an HDFS temp directory is hdfs://masterhost:9000/tmp/ and the master hostname is masterhost, you would run

$ export HADOOP_HOME=/hadoop
$ /hadoop/bin/hadoop fs -mkdir hdfs://masterhost:9000/tmp
$ /alluxio/integration/yarn/bin/alluxio-yarn.sh 3 hdfs://masterhost:9000/tmp/ masterhost

You may also start the Alluxio Master node separately from Yarn in which case the above startup will automatically detect the Master at the address provided and skip initialization of a new instance. This is useful if you have a particular host you’d like to run the Master on, which isn’t part of your Yarn cluster, like an AWS EMR Master Instance.

The script will launch an Alluxio Application Master on Yarn, which will then request containers for the Alluxio master and workers. You can check the YARN UI in the browser to watch the status of the Alluxio job.

Running the script will produce output containing something like

INFO impl.YarnClientImpl: Submitted application application_1445469376652_0002

This application ID can be used to destroy the application by running

$ /hadoop/bin/yarn application -kill application_1445469376652_0002

The ID can also be found in the YARN web UI.

Test Alluxio

Once you have the Alluxio application running, you can check its health by configuring alluxio.master.hostname=masterhost in conf/alluxio-site.properties and running

$ /alluxio/bin/alluxio runTests
Need help? Ask a Question