Alluxio, formerly known as Tachyon, is the world’s first memory speed virtual distributed storage system. It unifies data access and bridges computation frameworks and underlying storage systems. Applications only need to connect with Alluxio to access data stored in any underlying storage systems. Additionally, Alluxio’s memory-centric architecture enables data access at speeds that is orders of magnitude faster than existing solutions.
In the big data ecosystem, Alluxio lies between computation frameworks or jobs, such as Apache Spark, Apache MapReduce, Apache HBase, Apache Hive, or Apache Flink, and various kinds of storage systems, such as Amazon S3, Google Cloud Storage, OpenStack Swift, GlusterFS, HDFS, MaprFS, Ceph, NFS, and Alibaba OSS. Alluxio brings significant performance improvement to the ecosystem; for example, Baidu uses Alluxio to speedup the throughput of their data analytics pipeline 30 times. Barclays makes the impossible possible with Alluxio to accelerate jobs from hours to seconds. Qunar performs real-time data analytics on top of Alluxio. Beyond performance, Alluxio bridges new workloads with data stored in traditional storage systems. Users can run Alluxio using its standalone cluster mode, for example on Amazon EC2, Google Compute Engine, or launch Alluxio with Apache Mesos or Apache Yarn.
Alluxio is Hadoop compatible. Existing data analytics applications, such as Spark and MapReduce programs, can run on top of Alluxio without any code change. The project is open source under Apache License 2.0 and is deployed at many companies. It is one of the fastest growing open source projects. With three years of open source history, Alluxio has attracted more than 600 contributors from over 150 institutions, including Alibaba, Alluxio, Baidu, CMU, Google, IBM, Intel, NJU, Red Hat, UC Berkeley, and Yahoo. The project is the storage layer of the Berkeley Data Analytics Stack (BDAS) and also part of the Fedora distribution. Today, Alluxio is deployed in production by 100s organizations, and runs on clusters that exceed 1,000 nodes.
Flexible File API Alluxio’s native API is similar to that of the
java.io.Fileclass, providing InputStream and OutputStream interfaces and efficient support for memory-mapped I/O. We recommend using this API to get the best performance from Alluxio. Alternatively, Alluxio provides a Hadoop compatible FileSystem interface, allowing Hadoop MapReduce and Spark to use Alluxio in place of HDFS.
Pluggable Under Storage To provide fault-tolerance, Alluxio checkpoints data to the underlying storage system. It has a generic interface to make plugging different underlayer storage systems easy. We currently support Microsoft Azure Blob Store, Amazon S3, Google Cloud Storage, OpenStack Swift, GlusterFS, HDFS, MaprFS, Ceph, NFS, Alibaba OSS, Minio, and single-node local file systems, and support for many other file systems is coming.
Tiered Storage With Tiered Storage, Alluxio can manage SSDs and HDDs in addition to memory, allowing for larger datasets to be stored in Alluxio. Data will automatically be managed between the different tiers, keeping hot data in faster tiers. Custom policies are easily pluggable, and a pin concept allows for direct user control.
Unified Namespace Alluxio enables effective data management across different storage systems through the mount feature. Furthermore, transparent naming ensures that file names and directory hierarchy for objects created in Alluxio is preserved when persisting these objects to the underlying storage system.
Lineage Alluxio can achieve high throughput writes without compromising fault-tolerance by using lineage, where lost output is recovered by re-executing the jobs that created the output. With lineage, applications write output into memory, and Alluxio periodically checkpoints the output into the under file system in an asynchronous fashion. In case of failures, Alluxio launches job recomputation to restore the lost files.
Web UI & Command Line Users can browse the file system easily through the web UI. Under debug mode, administrators can view detailed information of each file, including locations, checkpoint path, etc. Users can also use
./bin/alluxio fsto interact with Alluxio, e.g. copy data in and out of the file system.
To quickly get Alluxio up and running, take a look at our Getting Started page which will go through how to deploy Alluxio and run some basic examples in a local environment. See a list of companies that are using Alluxio.
Downloads and More
You can get the released versions of Alluxio from the Project Downloads Page. Each release comes with prebuilt binaries compatibile with various Hadoop versions. If you would like to build the project from the source code, check out the Building From Master Branch Documentation. If you have any questions, please feel free to ask at our User Mailing List. For users can not access Google Group, please use its mirror (notes: the mirror does not have infomation before May 2016).