Unifying Big Data Workloads

Try querying Big Data sets and computing results through high volumes and variety across multiple independent storage systems – you’ll find a tangled web in the Tower of Babel, where platforms communicate in different languages. Then ask for speedy manipulations with that data set and it seems almost impossible. This describes the challenge faced by many organizations today.

Businesses and organizations have found moving and constantly changing data platforms costly in terms of time and money. This increased demand has necessitated movement towards a resolution with an abstraction layer in the middle, allowing compute to talk with the data without worrying where the data was stored.

In a recent DATAVERSITY® interview, Dipti Borkar, the VP at open source distributed storage company Alluxio, discussed the issue many organizations of all sizes are having with Big Data, disparate systems, and increased computing requirements. Throughout Borkar’s career, transitioning through many different companies until her recent landing at Alluxio, she “moved from one data platform or database to another, which was supposed to be better, cheaper, and faster for a range of applications.” This occurred in a time when data was going through a massive compute and storage explosion, as Borkar put it:

“Structured data, the predominant data type that existed early on, needed to be accessed along with the unstructured and semi-structured data of a more current generation. I learned this from my focus on search and semi-structured data. This strong interest drove me to develop a JSON database that I ran and built for web applications and mobile applications. Then I moved onto GPU analytics to look at the machine learning. All the while changing data platforms.”

Borkar felt a greater need for an easier way to work with Big Data, without a lot of back and forth. She notes that many people could get by when the ecosystem consisted of a simple stack, just HDFS, an Apache-distributed file system, and MapReduce, the software often used for processing Big Data sets. But more people collected data in Hadoop and HDFS – which had both data compute and storage – and many data frameworks sprouted, like Spark, Presto, Flink, and even Kafka. On the AI side, Borkar saw TensorFlow, Caffe, PyTorch, and a range of other models emerge. It caused a massive muddle. She remarked:

“Every compute structure wanted to access data from the storage tiers, which themselves were complex due to HDFS. Multiple HDFS clusters and possible object stores became more popular. S3, used for intensive and vivid graphics, also stored data. In addition to connecting to a range of storage systems, compute needed to link up with the legacy Network File Systems (NFS). Connecting every storage tier to every compute framework becomes quite complicated and expensive.”

Unifying Storage and Compute through a Middle Layer

According to Borkar, unifying storage and compute started by creating an abstraction on top of a loop or on top of object stores. She thought Alluxio’s open source platform (often thought of as All User Experience is I/O [input/output]), had found a quite interesting solution. Initially, CEO’s Haoyuan Li’s PhD thesis, called “Project Tachyon,” created an abstraction layer in the middle that co-located storage and compute.

The middleware was challenging to build, as algorithms that compute the data, and locations holding that data need very different environments. Borkar noted:

“Compute needs to be scaled differently with different metrics compared with storage. Compute relies on CPU-bound; however, storage binds to I/O. As a result of this difference, they need to grow them and scale quite independently. Algorithms that compute just need to communicate with the data—it does not really care where the data lives. The abstract middle layer does that.”

Borkar understands that companies want to move to hybrid and multi-cloud environments, leading to more difficult deployments from data migration and storage in new environments. Especially where some data no longer lives locally. The number of programming frameworks needing access to the data have increased as well.

The communication issue between storage and compute as a virtual distributed file system is a difficult problem to solve. Borkar explained that this open-sourced middle layer can default to be memory-based where data does not persist. As well, companies can set up the abstract layer to persist the data while allowing different file systems and object stores living underneath, to mount or connect to the user interface. Then the requested folders or buckets, if they contain object stores, present to the compute on the top. Alluxio’s product intermediary reduces storage overhead and enhances elastic computing.

Reducing Storage Overhead through Abstraction

Borkar further described the intermediate layer, sitting above storage, as reducing memory overhead and overload. Once data is accessed, it stays within the platform and can be used by other algorithms. She stated that with this technology:

“Essentially you automatically create a virtual data lake across multiple storage systems that unifies with a single namespace. All file paths look the same and fiddling between major applications is not needed. That is the in-between layer. If a Spark job has to access HDFS, it can do so through Alluxio to the popular compute frameworks on top through APIs. A Presto job using the same dataset can just go back to that in-between layer to find what it needs.”

Borkar emphasized the in-between layer brings together all the data in use. When the data subset is brought together and easily accessible, it can be reused by multiple frameworks in extremely performance-enhancing ways.

Co-locating Data for Easy Compute

Computing frameworks work with data independently from where it is deployed. Borkar used the example of a hybrid cloud. Say the data lives on-premise, but is manipulated in Azure or AWS, or other cloud environment, suggested Borkar:

“At the end of the day the data needs to be co-located with the computing functions. Traditionally, a lot of data engineering had be done using Extract, Transfer, Load (ETL), which consumes time. With Alluxio, you mount the folder or working set from HDFS or any object store and link it back to the middleware; as requests for that data set or object store are sent by the programs, the in-between layer will go and pull that data back and reposition it with your compute.”

This becomes an excellent solution for the hybrid cloud environment. Borkar believes the in-between layer not only reduces engineering and data migration, making it easier to work with data, it also gives a performance boost where users can work at internal network speed rather than the pace of the Wide Area Network (WAN).

Locality: A Foundation of Data Architecture

Alluxio’s project middleware has caught on as “internet companies have started to use it widely.” From its start at the AMPLab in UC Berkeley to present, Borkar has seen widespread adoption, with more than 800 project contributors at 200 institutions. Other engineers and companies have taken Alluxio’s open source to solve parts of the problem getting compute and memory, “like storage abstraction or data deployment through mounting data stores to different systems, i.e. Kubernetes,” said Borkar. However, in the end, organizations have a lot of manipulation as other open source solutions are not a data store or a virtual data lake. She explained:

“In any environment with data-driven applications, whether it is SQL or more machine learning, locality forms the foundation of Data Architecture. You must think it through, how to co-locate storage and compute. In hybrid or multi-cloud environments there isn’t a good solution right now outside of a virtual distributed file system, like Alluxio.”

For this reason, Alluxio’s middleware technology has flourished in North America and Asia, especially in China and Singapore. Eight of the ten largest Asian internet companies employ Alluxio at a massive scale, up to 1,300 nodes in the largest-scale cluster.

“This project has become an accelerator for data-driven workloads, particularly when customers need to leverage highly scalable and elastic compute environments. Increasingly this is the case with Kubernetes and Mesos. As an enabler to operate on the hybrid and cloud flexibly, we are continuing to see use cases in telecommunications, retail, and internet companies.”

This in-between layer translates between the mechanisms that hold the data and that manipulate the data. A simple solution to a tower of Babel-like problem.

Image used under license from Shutterstock.com

LISTEN NOW: MY CAREER IN DATA PODCAST

Data Topics

Leave a Reply Cancel reply