Solving business problems using big data depends upon the approach taken. For example, if an organization only knows data warehouses, then challenges will be framed to fit using a data warehouse. As Abraham Maslow, a prominent psychologist eloquently said “I suppose it is tempting, if the only tool you have is a hammer, to treat everything as if it were a nail.” This observation can be applied to big data where a data warehouse can become synonymous with a hammer. But, not all business data requirements fit into the category of a nail that a data warehouse can address, as discovered by Carolinas Healthcare System. The on-going debate of whether to use a data warehouse vs. data lake are many, but when viewed through the lens of a focused Data Architecture Strategy, the choices become more well-defined.
In response to a dilemma where enterprises or projects have complex and diversified data, with many different concepts, the data lake strategy has been added to the tool box. The term data lake, coined by James Pentaho in 2010, describes a tool that works upon different data nodes. Since 2010, vendors and enterprises as well as the Federal Intelligence Agencies have been using data lakes to store data that does not fit into a typical data warehouse and to add insights into security.
The buzz about data lakes shows many businesses need them to stay afloat with a fast-moving market place and with ever changing data uses and needs. Many companies can no longer afford to keep their heads in the sand about data lakes. Businesses need to understand both data warehouses and data lakes and when and how to apply them.
Two Different Models: Data Warehouse vs. Data Lake
Gartner defines a data warehouse as, “A storage architecture designed to hold data extracted from transaction systems, operational data stores and external sources… suitable for enterprise-wide data analysis and reporting for predefined business needs.” Think of a data warehouse like a travel itinerary.
A family who plans to go to some place for the summer (say Alaska) contacts places for lodging, restaurants, and attractions in advance of the trip. They write down where they are going and when they will be there for the entire trip. A person looking in on the house and feeding their dog and pets, has the itinerary in case of an emergency Similarly, a data warehouse provides clearly defined communications, for a known aggregate set of data, to a well-defined user set. Businesses generate a known set of analysis and reports from the data warehouse.
In contrast a data lake “is a collection of storage instances of various data assets additional to the originating data sources.” A data lake presents an unrefined view of data to only the most highly skilled analysts.” Consider a data lake concept like a family going to Alaska that wants to be flexible. The family rents a car from the airport. When in the car, the family members decide where to go as they drive along and adjusting the route on the fly according to what scenery looks interesting.
When, the people need a place to stay overnight, they try Hotwire to locate a hotel on the spot or stop by many places in town, even considering cabins and yurts. Depending on what is available (whether a lodge has any rooms) and suggestions the locals may have (e.g. the gas station attendant or a person sitting outside a cafe), the family decides where to stay.
The family may or may not be able to be contacted by a house sitter, but the family has more flexibility to go anywhere and to consider a wide variety of possibilities. A data lake operates similarly, with a more broad and distributed context, where some questions remain ambiguous, with an undefined set of users and a variety different data presentations.
Similarities Between Data Warehouses and Data Lakes
While data warehouses and data lakes refer to different Data conceptual tactics, both share common characteristics. As Kelle O’ Neal, the Founder and CEO of First San Francisco Partners, mentions in the DATAVERSITY® Data Lake vs. Data Warehouse Webinar, implementing either Data Architecture does not mean the issues with data go away. The similarities between a data warehouse vs. data lake are many:
- Pertain to a data storage architecture
- Need a business purpose to exist and persist
- Drive business’ benefits
- Need some governance and oversight around the data
- Require some structure to understand what the data means
Contrasting Data Warehouses and Data Lakes
Data warehouses and data lakes complement each other as data-related strategies. As the key differences between a data warehouse vs. data lake table demonstrates, where the data warehouse approach falls short the data lake fills in the gaps:
Photo Credit: First San Francisco Partners
Data warehouses rely on the assumption that available knowledge about a schema, at the time of constructions, will be sufficient to address a business problem. Business leaders and developers design relational databases. Information writes to the data warehouse according to this scheme allowing for structured reports.
Should a new business requirement emerge, that changes fundamentally the original data structure, then it can be incredibly time consuming, from six to nine months, to remodel the data warehouse. Even worse, missing a critical data attribute may lead to an early data warehouse death, where internal and external customers find it easier to gather and store the data themselves, in the data warehouse. At this point, business leaders may be wishing for a more Agile structure.
A data lake strategy allows users to easily access raw data, to consider multiple data attributes at once, and the flexibility to ask ambiguous business driven questions. Typically, companies have implemented Apache Hadoop, NoSQL or similar technologies to set up a schema on read architecture, the data lake. But data lakes can end up Data Swamps where finding business value becomes like a quest to find the Holy Grail. Data lakes need data scientists or analysts with considerable expertise for finding the diamonds (useful information) in the rough (raw data).
This can require enterprises to spend a lot of time and money to make a data lake worthwhile and not just a pile of data. Businesses then start to agree with Nick Heudecker, Research Director at Gartner, that to meet the needs of wider audiences requires curated repositories with governance, semantic consistency and access controls — elements already found in a data warehouse.
Data Warehouse vs. Data Lake Use Cases
When does a business leader decide to move forward with a data warehouse or a data lake approach? This requires documenting business needs, analyzing characteristics, crafting versions of a best fit architecture, and gathering data groupings to best give data insights. Data must be purpose driven. In order give a starting place for such ideas, find the case studies as described below:
- Data Warehouse: For an organization in the very competitive insurance industry agent and policyholder retention become important. Especially where insurance brokers operate independently in a mature but evolving marketplace. As described by John Ladley, an insurance company decided, recently, to address its data needs through a data warehouse. Internal and external data sources were packaged through ETL (extract, transform, and load) to a data warehouse. Customers, who could easily switch allegiances to other businesses, became retained due to service, in part, consistent and easy reporting for sales and marketing, underwriting and claims management.
- Data Lake: An organization, with directed Data Governance by Shannon Fuller, needed a data system to “support innovation and insights in health care service delivery.” Faced with the challenge to create a value based model from disparate informational sources, for a wide variety of users, from clinical to billing services, quickly accessible in one place. Shannon concluded that a data lake concept, touting a Data Architecture with one common repository that enabled quicker delivery. Using a Hadoop implementation, data from various sources filtered across read only operational, curated, analytical sandbox, and persistent data layers, meeting the company’s needs.
As discussed, deciding to implement a data warehouse vs. data lake architecture provides different approaches to data analysis and usage. Which one to use and when depends upon some planning ahead of time. If a business purpose compares to a travel plan, maybe a combination of both strategies work best. In the example where a family takes a trip to Alaska, they may plan a structured itinerary through a sailboat based kayaking tour of Prince Williams Sound. The second week, the family may rent a car from Anchorage and explore the Alaska Highway, taking in all the sights and sounds of Homer. Likewise, a company may use a combination of a data warehouse(s) and/or a data lake(s) in reaching its business’ destinations and in effectively using data.
PImage used under license from Shutterstock.com