Running Apache Flink on Alluxio

This guide describes how to get Alluxio running with Apache Flink, so that you can easily work with files stored in Alluxio.

Prerequisites

The prerequisite for this part is that you have Java. We also assume that you have set up Alluxio in accordance to these guides Local Mode or Cluster Mode.

Please find the guides for setting up Flink on the Apache Flink website.

Configuration

Apache Flink allows to use Alluxio through a generic file system wrapper for the Hadoop file system. Therefore, the configuration of Alluxio is done mostly in Hadoop configuration files.

Set property in core-site.xml

If you have a Hadoop setup next to the Flink installation, add the following property to the core-site.xml configuration file:

<property>
  <name>fs.alluxio.impl</name>
  <value>alluxio.hadoop.FileSystem</value>
</property>

In case you don’t have a Hadoop setup, you have to create a file called core-site.xml with the following contents:

<configuration>
  <property>
    <name>fs.alluxio.impl</name>
    <value>alluxio.hadoop.FileSystem</value>
  </property>
</configuration>

Next, you have to specify the path to the Hadoop configuration in Flink. Open the conf/flink-conf.yaml file in the Flink root directory and set the fs.hdfs.hadoopconf configuration value to the directory containing the core-site.xml. (For newer Hadoop versions, the directory usually ends with etc/hadoop.)

Generate and Distribute the Alluxio Client Jar

In order to communicate with Alluxio, we need to provide Flink programs with the Alluxio Core Client jar. Generate the Flink compatible client jar by building the entire project with the Flink profile from the top level alluxio directory:

mvn clean package -Pflink -DskipTests

We need to make the Alluxio jar file available to Flink, because it contains the configured alluxio.hadoop.FileSystem class.

There are different ways to achieve that:

  • Put the alluxio-core-client-1.4.0-jar-with-dependencies.jar file into the lib directory of Flink (for local and standalone cluster setups)
  • Put the alluxio-core-client-1.4.0-jar-with-dependencies.jar file into the ship directory for Flink on YARN.
  • Specify the location of the jar file in the HADOOP_CLASSPATH environment variable (make sure its available on all cluster nodes as well). For example like this:
export HADOOP_CLASSPATH=/pathToAlluxio/core/client/target/alluxio-core-client-1.4.0-jar-with-dependencies.jar

In addition, if there are any properties specified in conf/alluxio-site.properties, translate those to env.java.opts in {FLINK_HOME}/conf/flink-conf.yaml for Flink to pick up Alluxio configuration.

Using Alluxio with Flink

To use Alluxio with Flink, just specify paths with the alluxio:// scheme.

If Alluxio is installed locally, a valid path would look like this alluxio://localhost:19998/user/hduser/gutenberg.

Wordcount Example

This example assumes you have set up Alluxio and Flink as previously described.

Put the file LICENSE into Alluxio, assuming you are in the top level Alluxio project directory:

$ bin/alluxio fs copyFromLocal LICENSE alluxio://localhost:19998/LICENSE

Run the following command from the top level Flink project directory:

$ bin/flink run examples/batch/WordCount.jar --input alluxio://localhost:19998/LICENSE --output alluxio://localhost:19998/output

Open your browser and check http://localhost:19999/browse. There should be an output file output which contains the word counts of the file LICENSE.

Need help? Ask a Question