Should your organization implement a data lake or data warehouse? As Kelle O’Neal, CEO and founder of First San Francisco Partners, pointed out during her presentation at the DATAVERSITY® 2017 Enterprise Analytics Online Conference, you don’t need to choose just one: Many organizations use both a data lake and data warehouse.
That’s because data lakes and data warehouses essentially complement each other, said O’Neal: “We talk about it as data lake vs. data warehouse, but the reality is that it’s not a versus. It’s not an either/or. There are many shades of grey in between. You can start with a data warehouse and blend it into a data lake, or you can start with a data lake and blend it into a data warehouse.”
The key to success, said O’Neal, is to understand how and when to use each one based on your organization’s requirements and ability to fulfill those requirements.
How Does a Data Lake Differ from a Data Warehouse?
O’Neal offered two definitions to get everyone on the same page. IT research company Gartner, she said, describes a data warehouse as “a storage architecture that combines data in an aggregate, summary form suitable for enterprise-wide data analysis and reporting for predefined business needs,” whereas data assets in a data lake are “stored in a near-exact, or even exact, copy of the source format.”
Pentaho CTO James Dixon, who coined the term “data lake,” once used this analogy: “If you think of a data mart as a store of bottled water – cleansed and packaged and structured for easy consumption – the data lake is a large body of water in a more natural state. The contents of the data lake stream in from a source to fill the lake, and various users of the lake can come to examine, dive in, or take samples.” (You can substitute “data mart” for “data warehouse” in this context, noted O’Neal.)
Data lakes are not interchangeable with data warehouses. Both can be used to store enterprise data, but they differ in a number of critical ways:
- Data type: Data warehouses contain only structured data required to answer a certain set of questions, whereas data lakes can handle all types of data, including structured, semi-structured, and raw, making them naturally more flexible. “Data lakes are designed for more fluid environments in which some of the questions are known, but many are not,” said O’Neal. “This provides the opportunity to leverage a broad and vast volume of data in potentially new and novel ways.”
- Processing and agility: In data warehouses, the structure of data is applied as it is loaded in (schema-on-write), while the structure of data in data lakes is applied as it is pulled out (schema-on-read). “The implication with schema-on-read is that you can apply multiple structures or lenses, depending on the purpose of the inquirer,” said O’Neal. “The requirements for structure in a data warehouse, such as schema-on-write, make it more fixed and less agile.”
- Users: Data warehouses are ideal for business users with a “specific viewpoint they are trying to understand,” said O’Neal, adding that you can build a data mart on top of a data warehouse to “provide an even more specific user community with a subset of that aggregated data.” More experimental-friendly data lakes call for more experienced users to reap their benefits.
Challenges of Data Lakes and Data Warehouses
You might already be familiar with some of the drawbacks of data warehouses, since they’ve been around for decades: hefty expenses for large data volumes, limited scalability, difficulty adding new data and subjects. “Because the structure in a data warehouse is more fixed, it can be incredibly time-consuming to change that structure to add new data elements, data types, and data sources,” explained O’Neal.
“We had one client whose biggest challenge was the lead time to add a new data type into a data warehouse. Nine months! If you have a specific question or you need to do a certain analysis, by the time nine months comes around, that might actually be a moot point because either the business has moved on, your competitors have outsold you, or something else has changed.”
To optimize the data warehouse, focus on supporting self-service, boosting consistency and understanding of performance, ensuring Data Quality, and maintaining Data Governance. You can also address the limitations of a data warehouse by leveraging a data lake. Some organizations, for instance, are replacing their data warehouse staging area – the place where relatively unfiltered data lands before going into the warehouse – with a data lake, in order to “have the opportunity for a raw sort of data lake environment but still take advantage of some of the structured reporting and analysis that can come out of the data warehouse,” said O’Neal.
That’s not to say data lakes don’t have issues of their own. O’Neal warned that they lack built-in governance, can collect clutter and become junk drawers, are not as secure as data warehouses, and tend to be resource-intensive, “not many people understand how to implement data lakes and the technologies needed to do so.”
How to overcome the challenges of a data lake? O’Neal shared a case study of a large global organization in the travel industry that, given the opportunity to build their infrastructure anew, established a data lake as their primary data repository. They took several bold measures, including launching a governance program focused on the data lake, improving data understanding through a business glossary and better metadata, and providing a “warehouse-esque environment within the data lake to accommodate those users who were more comfortable with a structured format,” said O’Neal.
That last strategy is especially smart, said O’Neal, as it’s unlikely you’ll be hiring a whole new team of data scientists for your data lake: There are at least 20 job openings for every data scientist on the market, estimated O’Neal. She instead recommended helping your current user community embrace “all of that data at their fingertips.”
“You need to think about how you can take your existing staff and your existing skillset and upskill them to ensure that you have the resources to take advantage of the data lake. And then, as you increase the skills of each of those people, you’re going to want to put on some sort of golden handcuff so that they don’t leave in order to take advantage of that disequilibrium on the market.”
Chances are, there’s a place for both data lakes and data warehouses in your organization. “The data lake did not kill the data warehouse,” joked O’Neal, referencing the popular ’80s song “Video Killed the Radio Star.” Where one falls short, the other can fill in the gaps. Just remember to work with what you already have: “You want to make sure you are enabling people over time, not creating something that’s so complex that it’s actually not usable for your existing organization,” said O’Neal. “You want to make sure that you’re growing into the organization of the future.”
Check out Enterprise Analytics Online at eanalyticsonline.com.
Here is the video of Kelle O’Neal’s Enterprise Analytics Online 2017 presentation:
Photo Credit: Risto Viita/Shutterstock.com