Data orchestration means trying to bring order and speed to a complex Big Data ecosystem, a conglomeration of storage systems like Amazon S3, Apache HDFS, or OpenStack Swift and computation frameworks and applications such as Apache Spark and Hadoop MapReduce. The data stack is fragmented and performance-challenged by a proliferation of data silos.
The technology aims to break through the “walled gardens” that today inhibit the ability for applications and people to access data sources in any format and from any location. As businesses continue shifting to hybrid and multi-cloud architectures, and as data keeps growing, forward-compatibility across the data ecosystem grows in importance.
An open-source project, Tachyon, which came out of UC Berkeley’s AMPLab, was directed at keeping storage from becoming a bottleneck in workloads. Haoyuan Li, who co-created Apache Spark Streaming and is an Apache Spark founding Project Management Committee (PMC) member, created the distributed file system that enables reliable data sharing at memory speed across cluster computing frameworks. Yahoo, Tachyon Nexus, Redhat, Nokia, Intel, and Databricks were all contributors to it.
Tachyon now is known as Alluxio and today is used in production to manage petabytes of data for Alibaba Cloud, Barclays, ING, Microsoft, and many other large companies. The largest deployment exceeds 1300 nodes. Li is now the company’s CTO.
The Move to Cloud and Cloud Analytics
Storage systems have really dominated over the past decade, said CEO Steven Mih in a recent DATAVERSITY® interview, but now the industry is moving to the cloud and cloud analytics systems. And data orchestration is really critical for moving data from different systems and to the new frameworks that the organization wants to use.
“The digital transformation is stuck in second gear,” said Mih. For data-driven digital transformation, data needs to be quickly available to analytics systems. But when data is distributed over multiple data centers or clouds, it’s likely that a query would need to transfer data from one place to another, causing huge delays.
Alluxio sits between compute and storage and provides a single point of data access and integration. The data orchestration solution isn’t trying to get rid of data silos, but rather to “embrace the chaos,” as Mih put it. “Let applications that need the data be able to have a system that pulls that to them. That will be the world of hybrid and multi-cloud.”
Data being accessed — whether it’s on a local storage system or in a public cloud — is moved into memory. In the first instance, that data can be served at the speed of the network and in the second at the speed of the local memory or disk. Data accessed remotely is then moved into the memory of the local cluster.
Data can be local to compute workloads for Spark, Presto, and Hive caching; files and objects are accessible whether on-prem or in the cloud — and elastic, as you can orchestrate the data across multiple clouds.
As what in its simplest form is a virtual file system that transparently connects to existing storage systems and presents them as a single system to users, Alluxio can help with the Data Management challenges of deep learning. Because it can integrate with storage systems, deep learning frameworks only need to interact with Alluxio to be able to access all the data from all storage. This way, training can be performed on all data from any data source, which can lead to better model performance, the company says.
Death to Data Wrangling
No one wants to wrangle data (copy data to different data silos that could be in the cloud or elsewhere) if it can be avoided. And Alluxio helps users get beyond connecting everything together one at a time with APIs.
With a host of clustered framework systems, any time you have a new cluster you have to get the APIs to work with your data source, Mih said. “Say you have five frameworks with one data source — that’s five connectors. And if you’ve got a second data source that’s ten connectors, right?” And on and on.
Rethinking this with layers translates into just plugging new data sources into the hub — that is, the central transit center. “We’re going to take an application-centered view, not a storage-centered view,” said Mih. This is a help to staying compliant with data regulations as well as making data available on demand.
In this respect, there’s no reason to put all the data from old platforms — which could amount to hundreds of terabytes — into the cloud all at once.
“You can just take your relevant data and put that in the cloud,” he said. “The amount of relevant data is a small percentage of your data and that’s what you really care about. That could be just three to five percent of all data overall. Data orchestration works with making what we call the ‘active site of data’ available and elastic.”
Organizations can move data gradually until they’re ready to go completely to the cloud.
“That’s the direction that people are going to,” he said. “They will migrate and most likely they’ll start from a hybrid environment and then move into a single cloud and then to a multi-cloud situation. That’s when you have multiple silos of data that are being generated based on different applications creating operational data.”
From a cost perspective, using data orchestration is the lowest cost way to run analytics, Mih said. “You have the lowest, easiest place to maintain your operations for your storage and you’ll have the operations of a scale-out system for analytics so you’re not paying for compute that you’re not using. That’s the new modern data analytics and it needs to include data orchestration.”
In July the company announced Alluxio 2, its largest open source release to simplify and accelerate multi-cloud, data analytics, and AI workloads on private, public, or hybrid cloud infrastructures, leveraging valuable data wherever it might be stored.
Image used under license from Shutterstock.com