Running Spark on Alluxio

This guide describes how to run Apache Spark on Alluxio. HDFS is used as an example of a distributed under storage system. Note that, Alluxio supports many other under storage systems in addition to HDFS and enables frameworks like Spark to read data from or write data to any number of those systems.


Alluxio works together with Spark 1.1 or later out-of-the-box.


General Setup

  • Alluxio cluster has been set up in accordance to these guides for either Local Mode or Cluster Mode.

  • Alluxio client will need to be compiled with the Spark specific profile. Build the entire project from the top level alluxio directory with the following command:

$ mvn clean package -Pspark -DskipTests
  • Add the following line to spark/conf/spark-defaults.conf.
spark.driver.extraClassPath /<PATH_TO_ALLUXIO>/client/default/alluxio-1.6.0-SNAPSHOT-default-client.jar
spark.executor.extraClassPath /<PATH_TO_ALLUXIO>/client/default/alluxio-1.6.0-SNAPSHOT-default-client.jar
  • Advanced users can choose to compile this client jar from the source code, follow the instructs here and use the generated jar at /<PATH_TO_ALLUXIO>/core/client/runtime/target/alluxio-core-client-runtime-1.6.0-SNAPSHOT-jar-with-dependencies.jar for the rest of this guide.

Additional Setup for HDFS

  • If Alluxio is run on top of a Hadoop 1.x cluster, create a new file spark/conf/core-site.xml with the following content:
  • If you are running alluxio in fault tolerant mode with zookeeper and the Hadoop cluster is a 1.x, add the following additionally entry to the previously created spark/conf/core-site.xml:

and the following line to spark/conf/spark-defaults.conf:

spark.driver.extraJavaOptions -Dalluxio.zookeeper.address=zookeeperHost1:2181,zookeeperHost2:2181 -Dalluxio.zookeeper.enabled=true
spark.executor.extraJavaOptions -Dalluxio.zookeeper.address=zookeeperHost1:2181,zookeeperHost2:2181 -Dalluxio.zookeeper.enabled=true

Use Alluxio as Input and Output

This section shows how to use Alluxio as input and output sources for your Spark applications.

Use Data Already in Alluxio

First, we will copy some local data to the Alluxio file system. Put the file LICENSE into Alluxio, assuming you are in the Alluxio project directory:

$ bin/alluxio fs copyFromLocal LICENSE /LICENSE

Run the following commands from spark-shell, assuming Alluxio Master is running on localhost:

> val s = sc.textFile("alluxio://localhost:19998/LICENSE")
> val double = => line + line)
> double.saveAsTextFile("alluxio://localhost:19998/LICENSE2")

Open your browser and check http://localhost:19999/browse. There should be an output file LICENSE2 which doubles each line in the file LICENSE.

Use Data from HDFS

Alluxio supports transparently fetching the data from the under storage system, given the exact path. Put a file LICENSE into HDFS under the folder Alluxio is mounted to, by default this is /alluxio, meaning any files in hdfs under this folder will be discoverable by Alluxio. You can modify this setting by changing the ALLUXIO_UNDERFS_ADDRESS property in on the server.

Assuming the namenode is running on localhost and you are using the default mount directory /alluxio:

$ hadoop fs -put -f /alluxio/LICENSE hdfs://localhost:9000/alluxio/LICENSE

Note that Alluxio has no notion of the file. You can verify this by going to the web UI. Run the following commands from spark-shell, assuming Alluxio Master is running on localhost:

> val s = sc.textFile("alluxio://localhost:19998/LICENSE")
> val double = => line + line)
> double.saveAsTextFile("alluxio://localhost:19998/LICENSE2")

Open your browser and check http://localhost:19999/browse. There should be an output file LICENSE2 which doubles each line in the file LICENSE. Also, the LICENSE file now appears in the Alluxio file system space.

NOTE: Block caching on partial reads is enabled by default, but if you have turned off the option, it is possible that the LICENSE file is not in Alluxio storage. This is because Alluxio only stores fully read blocks, and if the file is too small, the Spark job will have each executor read a partial block. To avoid this behavior, you can specify the partition count in Spark. For this example, we would set it to 1 as there is only 1 block.

> val s = sc.textFile("alluxio://localhost:19998/LICENSE", 1)
> val double = => line + line)
> double.saveAsTextFile("alluxio://localhost:19998/LICENSE2")

Using Fault Tolerant Mode

When running Alluxio with fault tolerant mode, you can point to any Alluxio master:

> val s = sc.textFile("alluxio-ft://stanbyHost:19998/LICENSE")
> val double = => line + line)
> double.saveAsTextFile("alluxio-ft://activeHost:19998/LICENSE2")

Data Locality

If Spark task locality is ANY while it should be NODE_LOCAL, it is probably because Alluxio and Spark use different network address representations, maybe one of them uses hostname while another uses IP address. Please refer to this jira ticket for more details, where you can find solutions from the Spark community.

Note: Alluxio uses hostname to represent network address except in version 0.7.1 where IP address is used. Spark v1.5.x ships with Alluxio v0.7.1 by default, in this case, by default, Spark and Alluxio both use IP address to represent network address, so data locality should work out of the box. But since release 0.8.0, to be consistent with HDFS, Alluxio represents network address by hostname. There is a workaround when launching Spark to achieve data locality. Users can explicitly specify hostnames by using the following script offered in Spark. Start Spark Worker in each slave node with slave-hostname:

$ $SPARK_HOME/sbin/ -h <slave-hostname> <spark master uri>

For example:

$ $SPARK_HOME/sbin/ -h simple30 spark://simple27:7077

You can also set the SPARK_LOCAL_HOSTNAME in $SPARK_HOME/conf/ to achieve this. For example:


In either way, the Spark Worker addresses become hostnames and Locality Level becomes NODE_LOCAL as shown in Spark WebUI below.



Running Spark on YARN

To maximize the amount of locality your Spark jobs attain, you should use as many executors as possible, hopefully at least one executor per node. As with all methods of Alluxio deployment, there should also be an Alluxio worker on all computation nodes.

When a Spark job is run on YARN, Spark launches its executors without taking data locality into account. Spark will then correctly take data locality into account when deciding how to distribute tasks to its executors. For example, if host1 contains blockA and a job using blockA is launched on the YARN cluster with --num-executors=1, Spark might place the only executor on host2 and have poor locality. However, if --num-executors=2 and executors are started on host1 and host2, Spark will be smart enough to prioritize placing the job on host1.

Failed to login Issues with Spark Shell

To run the spark-shell with the Alluxio client, the Alluxio client jar will have to be added to the classpath of the Spark driver and Spark executors, as described earlier. However, sometimes Alluxio will fail to determine the security user and will result in an error message similar to: Failed to login: No Alluxio User is found. Here are some solutions.

This is the recommended solution for this issue.

In Spark 1.4.0 and later, Spark uses an isolated classloader to load java classes for accessing the hive metastore. However, the isolated classloader ignores certain packages and allows the main classloader to load “shared” classes (the Hadoop HDFS client is one of these “shared” classes). The Alluxio client should also be loaded by the main classloader, and you can append the alluxio package to the configuration parameter spark.sql.hive.metastore.sharedPrefixes to inform Spark to load Alluxio with the main classloader. For example, the parameter may be set to:


[Workaround] Specify fs.alluxio.impl for Hadoop Configuration

If the recommended solution described above is infeasible, this is a workaround which can also solve this issue.

Specifying the Hadoop configuration fs.alluxio.impl may also help in resolving this error. fs.alluxio.impl should be set to alluxio.hadoop.FileSystem. There are a few ways to set these parameters.

Update hadoopConfiguration in SparkContext

You can update the Hadoop configuration in the SparkContext by:

sc.hadoopConfiguration.set("fs.alluxio.impl", "alluxio.hadoop.FileSystem")

This should be done early in your spark-shell session, before any Alluxio operations.

Update Hadoop Configuration Files

You can also add the properties to Hadoop’s configuration files, and point Spark to the Hadoop configuration files. The following should be added to Hadoop’s core-site.xml.

You can point Spark to the Hadoop configuration files by setting HADOOP_CONF_DIR in


To use fault tolerant mode, set the Alluxio cluster properties appropriately in an file which is on the classpath.


Alternatively you can add the properties to the Hadoop core-site.xml configuration which is then propagated to Alluxio.

Need help? Ask a Question