Data Lakes 101: An Overview

A Data Lake is a pool of unstructured and structured data, stored as-is, without a specific purpose in mind, that can be “built on multiple technologies such as Hadoop, NoSQL, Amazon Simple Storage Service, a relational database, or various combinations thereof,” according to a white paper called What is a Data Lake and Why Has it Become Popular?

A Data Lake allows multiple points of collection and multiple points of access for large volumes of data. James Dixon, founder of Pentaho Corp, who coined the term “Data Lake” in 2010, contrasts the concept with a Data Mart:

“If you think of a Data Mart as a store of bottled water – cleansed and packaged and structured for easy consumption – the Data Lake is a large body of water in a more natural state. The contents of the Data Lake stream in from a source to fill the lake, and various users of the lake can come to examine, dive in, or take samples.”

In Data Lake vs Data Warehouse: Key Differences, Tamara Dull, Director of Emerging Technologies at SAS Institute defines a Data Lake as “a storage repository that holds a vast amount of raw data in its native format, including structured, semi-structured, and unstructured data.”

Dull goes on to say that, “The cost of storing data is relatively low as compared to the Data Warehouse. There are two key reasons for this: First, Hadoop is open source software, so the licensing and community support is free. And second, Hadoop is designed to be installed on low-cost commodity hardware”

Shaun Connolly, Vice President of Corporate Strategy for Hortonworks, defines a Data Lake in his blog post, Enterprise Hadoop and the Journey to a Data Lake:

“A Data Lake is characterized by three key attributes:

Collect everything. A Data Lake contains all data, both raw sources over extended periods of time as well as any processed data.
Dive in anywhere. A Data Lake enables users across multiple business units to refine, explore and enrich data on their terms.
Flexible access. A Data Lake enables multiple data access patterns across a shared infrastructure: batch, interactive, online, search, in-memory and other processing engines.”

A Data Lake is not a quick-fix all your problems, according to Bob Violino, author of 5 Things CIOs Need to Know About Data Lakes. He says, “You can’t buy a ready-to-use Data Lake. Vendors are marketing Data Lakes as a panacea for Big Data projects, but that’s a fallacy.” He quotes Nick Heudecker, Research Director at Gartner, who says, “Like Data Warehouses, Data Lakes are a concept, not a technology. At its core, a Data Lake is a data storage strategy.”

Data Lakes Born out of Social Media Giants

PriceWaterhouseCooper (PwC) magazine summarizes the origin of the Data Lake concept in Data Lakes and the Promise of Unsiloed Data:

“The basic concepts behind Hadoop were devised by Google to meet its need for a flexible, cost-effective data processing model that could scale as data volumes grew faster than ever. Yahoo, Facebook, Netflix, and others whose business models also are based on managing enormous data volumes quickly adopted similar methods. Costs were certainly a factor, as Hadoop can be 10 to 100 times less expensive to deploy than conventional data warehousing. Another driver of adoption has been the opportunity to defer labor-intensive schema development and data cleanup until an organization has identified a clear business need. And Data Lakes are more suitable for the less-structured data these companies needed to process.”

Analyze Data Forward and Backward in Time

The Data Lake allows collection of data for future needs before it’s possible to know what those needs are, so it has tremendous potential. Data is not limited by the scope of thinking present when the data is captured, but is free to answer questions we don’t yet know to ask: “Data itself is no longer restrained by initial schema decisions, and can be exploited more freely by the enterprise,” says Edd Dumbill, Vice President of Strategy at Silicon Valley Data Science, writing in The Data Lake Dream. Data blogger Martin Fowler of ThoughtWorks says in a post titled Data Lakes, that “the Data Lake should contain all the data because you don’t know what people will find valuable, either today or in a couple of years time.”

Chris Campbell, BlueGranite blogger and Cloud Data Solutions Architect for Microsoft says,

“The Data Lake retains ALL data. Not just data that is in use today but data that may be used, and even data that may never be used just because it MIGHT be used someday. Data is also kept for all time so that we can go back in time to any point to do analysis.”

Tamara Dull adds that a Data Lake’s lack of structure, “gives developers and Data Scientists the ability to easily configure and reconfigure their models, queries, and apps on-the-fly.”

Supports Multiple Users

Another feature of the Data Lake approach is that it meets the needs of a variety of users. Users all over the company can have access to the data for whatever needs they can imagine – moving from a centralized model to a more distributed one: “The potential exists for users from different business units to refine, explore, and enrich data,” from Putting the Data Lake to Work , a white paper by Hortonworks & Teradata.

Chris Campbell divides data users into three categories based on their relationship to the data: Those who simply want a daily report on a spreadsheet, those who do more analysis but like to go back to the source to get data not originally included, and those who want to use data to answer entirely new questions. He says, “The Data Lake approach supports all of these users equally well.”

Cost-Effective Storage

Campbell also says that Data Lakes are relatively cheap and easy to store because costs of storage are minimal and pre-formatting isn’t necessary. “Commodity, off-the-shelf servers combined with cheap storage makes scaling a Data Lake to terabytes and petabytes fairly economical.” According to Hortonworks & Teradata’s white paper the Data Lake concept “provides a cost-effective and technologically feasible way to meet Big Data challenges.”

Beware the “Swamp”

Martin Fowler cautions that there is “a common criticism of the Data Lake – that it’s just a dumping ground for data of widely varying quality, better named a ‘data swamp.’ The criticism is both valid and irrelevant.” He goes on to say:

“The complexity of this raw data means that there is room for something that curates the data into a more manageable structure (as well as reducing the considerable volume of data.) The Data Lake shouldn’t be accessed directly very much. Because the data is raw, you need a lot of skill to make any sense of it. You have relatively few people who work in the Data Lake, as they uncover generally useful views of data in the lake, they can create a number of data marts each of which has a specific model for a single bounded context.”

Varied Understanding of Data Context

End users may not know how to use data or what they’re looking at when data is not curated or structured, making it less useful: “The fundamental issue with the Data Lake is that it makes certain assumptions about the users of information,” says Nick Heudecker, in Data Lakes: Don’t Confuse Them With Data Warehouses, Warns Gartner.

Tamara Dull points out that despite the initial desire to provide access to data to everyone company-wide, like previous initiatives, expectation of across the board participation may disappoint:

“For a long time, the rallying cry has been, ‘BI and Analytics for everyone!’ We’ve built the data warehouse and invited ‘everyone’ to come, but have they come? On average, 20-25% of them have. Is it the same cry for the Data Lake? Will we build the Data Lake and invite everyone to come? Not if you’re smart. Trust me, a Data Lake, at this point in its maturity, is best suited for the data scientists.”

Are Data Lakes Better than Data Warehouses?

Tamara Dull notes that a Data Lake is not ‘Data Warehouse 2.0’ nor is it a replacement for the Data Warehouse: “So to answer the question—Isn’t a Data Lake just the data warehouse revisited?—my take is no.” John Morrell, the Senior Director of Product Marketing at Datameer also provided a number of important point on Data Lakes. These various discussions are paraphrased below.

Data Warehouse vs. Data Lake

Data in the warehouse is: Structured, processed
Processing for the warehouse is: Schema-on-write
Storage in the warehouse is: Expensive for large data volumes
Agility in the warehouse is: Less agile, fixed configuration
Security in the warehouse is: Mature
Users for the warehouse are: Business professionals

Data in the lake is: Structured/semi-structured/unstructured/raw
Processing for the lake is: Schema-on-read
Storage in the lake is: Designed for low-cost
Agility in the lake is: Highly agile, configure and reconfigure as needed
Security in the lake is: Maturing
Users for the lake are: Data Scientists, et.al.

Chris Campbell sees these key differences between the two:

Data Warehouse:
It represents an abstracted picture of the business organized by subject area.
It is highly transformed and structured.
Data is not loaded to the data warehouse until the use for it has been defined.
It generally follows an established methodology

Data Lake:
All data is loaded from source systems. No data is turned away.
Data is stored at the leaf level in an untransformed or nearly untransformed state.
Data is transformed and schema is applied to fulfill the needs of analysis.
It supports all users.
It adapts easily to changes and provides faster insights.

Although each has its proponents and detractors, it appears that there is room for both, “A Data Lake is not a Data Warehouse. They are both optimized for different purposes, and the goal is to use each one for what they were designed to do,” says Tamara Dull. “Or in other words, use the best tool for the job. This is not a new lesson. We’ve learned this one before. Now let’s do it.”

TAKE OUR DATA MANAGEMENT CERTIFICATION PREP COURSES

Data Topics

Data Lakes 101: An Overview

Leave a Reply Cancel reply