Bill Inmon, the “Father of Data Warehousing,” defines a Data Warehouse (DW) as, “a subject-oriented, integrated, time-variant and non-volatile collection of data in support of management’s decision making process.” In his white paper, Modern Data Architecture, Inmon adds that the Data Warehouse represents “conventional wisdom” and is now a standard part of the corporate infrastructure.
Matthew Mayo, Data Scientist at KDnuggets, in an article entitled, Big Data Key Terms Explained, quotes Data Mining textbook authors Han, Kamber, & Pei, who define a Data Warehouse as a data storage architecture which allows “business executives to systematically organize, understand, and use their data to make strategic decisions.” Certainly, the Data Warehouse is a known architecture in many modern enterprises.
The Data Warehouse has been employed successfully across many different enterprise use cases for years, though Data Warehouses have also transformed, and must continue to if they want to keep up with the changing requirements of contemporary Enterprise Data Management.
Bin Jiang in Is Inmon’s Data Warehouse Definition Still Accurate? reinterprets Inmon’s Data Warehouse definition, calling it, “An infrastructure-based on the information technology for an organization to integrate, collect, and prepare data on a regular basis for easing analysis.”
Oracle’s Data Warehousing Guide defines a Data Warehouse as a relational database:
“Designed for query and analysis rather than for transaction processing. It usually contains historical data derived from transaction data, but it can include data from other sources. It separates analysis workload from transaction workload and enables an organization to consolidate data from several sources.”
Oracle’s Data Warehousing Guide expands upon Inmon’s four characteristics in a number of ways:
- Subject-Oriented: The Data Warehouse is designed to help you analyze data. For example, to learn more about your company’s sales data, you can build a warehouse that concentrates on sales. Using this warehouse, you can answer questions like “Who was our best customer for this item last year?” This ability to define a Data Warehouse by subject matter, sales in this case, makes the Data Warehouse subject-oriented.
- Integrated: Integration is closely related to subject orientation. Data Warehouses must put data from disparate sources into a consistent format. They must resolve such problems as naming conflicts and inconsistencies among units of measure. When they achieve this, they are said to be integrated.
- Non-volatile: Nonvolatile means that, once entered into the warehouse, data should not change. This is logical because the purpose of a warehouse is to enable you to analyze what has occurred.
- Time-Variant: In order to discover trends in business, analysts need large amounts of data. This is very much in contrast to online transaction processing (OLTP) systems, where performance requirements demand that historical data be moved to an archive. A Data Warehouse’s focus on change over time is what is meant by the term time-variant.
Data Warehouse Structures
Oracle breaks down Data Warehouse architectures into three simplified structures: basic, basic with a staging area, and basic with a staging area and data marts. With a basic structure, operational systems and flat files provide raw data and data are stored, along with metadata and summary data, where end users can access it for analysis, reporting and mining. Adding a staging area, which sits between the data sources and the warehouse, provides a separate place for data to be cleaned before entering the warehouse. Oracle says it’s possible to “customize your warehouse’s architecture for different groups within your organization. You can do this by adding data marts, which are systems designed for a particular line of business.” It is possible to have separate data marts within the warehouse for sales, inventory and purchasing, for example, and end users can access data from one or all department data marts.
How are Data Warehouses Built?
Wayne Eckerson, Principal Consultant at Eckerson Group, in an article entitled Four Ways to Build a Data Warehouse, compares and contrasts the most commonly used approaches to creating a Data Warehouse.
“Data Warehousing managers need to be aware of these methodologies but not wedded to them,” he says. “These methodologies have shaped the debate about Data Warehousing best practices, and comprise the building blocks for methodologies developed by practicing consultants.”
The four primary approaches to Data Warehousing as discussed by Eckerson are:
Major Characteristics of Top-Down Approach
- Emphasizes the Data Warehouse.
- Starts by designing an enterprise model for a DW.
- Deploys multi-tier architecture comprised of a staging area, a DW, and “dependent” data marts.
- Staging area is persistent.
- DW is enterprise-oriented; data marts are function-specific.
- DW has atomic-level data; data marts have summary data.
- DW uses an enterprise-based normalized model; data marts use a subject-specific dimensional model.
- Users can query the Data Warehouse and data marts.
Major Characteristics of Bottom-Up Approach
- Emphasizes data marts.
- Starts by designing a dimensional model for a data mart.
- Uses a “flat” architecture consisting of a staging area and data marts.
- Staging area is largely non-persistent.
- Data marts contain both atomic and summary data.
- Data marts can provide both enterprise and function-specific views.
- Data mart consists of a single star schema, logically or physically deployed.
- Data marts are deployed incrementally and “integrated” using conformed dimensions.
Major Characteristics of Hybrid Approach
- Emphasizes Data Warehouse and data marts; blends “top-down” and “bottom-up” methods.
- Starts by designing enterprise and local models synchronously.
- Spends 2–3 weeks creating a high-level, normalized, enterprise model; fleshes out model with initial marts.
- Populates marts with atomic and summary data via a non-persistent staging area.
- Models marts as one or more star schemas.
- Uses ETL tool to populate data marts and exchange Metadata between ETL tool and data marts.
- Backfills a Data Warehouse behind the marts when users want views at atomic level across marts; instantiates the “fleshed out” enterprise model, and moves atomic data to the DW
Major Characteristics of Federated Approach
- Emphasizes the need to integrate new and existing heterogeneous BI environments.
- An architecture of architectures.
- Acknowledges the reality of change in organizations and systems that make it difficult to implement a formalized architecture.
- Rationalizes the use of whatever means possible to implement or integrate analytical resources to meet changing needs or business conditions.
- Encourages organizations to share dimensions, facts, rules, definitions, and data wherever possible, however possible.
By understanding these different approaches, Eckerson says, organizations can create a methodology that meets their unique needs, based on a foundation of best practice models.
Data Warehouses: Variations on a Theme
In Defining Data Warehouse Variants by Classification, Bin Jiang organizes the Data Warehouse based on four variations and eight classes.
A Topological, or Back-End variation includes classes based on characteristics of the data source side. He classifies a Data Warehouse as “single-source” if it has only one source application and “multi-source” if it is not single-source.
The second variation is based on its Organizational or Front-End classification, says Jiang. Data Warehouses that are dedicated to one part of the organization are considered “Departmental Data Warehouses,” and those employed by the whole organization are classified as “Enterprise Data Warehouses.”
A third variant is based on Temporality or Freshness. If the content is updated at intervals, for example, daily or weekly, Jiang classifies it as a ”Periodical Data Warehouse.” If it’s updated very shortly after it’s generated or changed, he classifies it as a “Real-time Data Warehouse.”
His fourth classification is Geographical, or Location-based. Data Warehouses are classified as “distributed” if the major data objects of the warehouse are stored and processed at different geographical locations, and “centralized” if all data objects are kept in the same location.
Evolution of the Data Warehouse
Historically, Data Warehouses have evolved using structured repetitive data that has been filtered or distilled before entering the Data Warehouse. In recent years, says Inmon, the Data Warehouse has evolved due to the use of contextual information that can be attached to unstructured data, allowing it to be stored in the warehouse as well. Says Inmon,
“Previously structured relational data could not be analytically mixed and matched with unstructured textual data. But with the advent of contextualization, these types of analysis can be done and are natural and easy to do.”
In the Data Warehouse, non-repetitive data, such as survey comments, emails, and conversations, are treated differently than repetitive occurrences of data, like that from click stream, metering, or machine or analog processing, Inmon says. “Non-repetitive data is textual-based data that was generated by the written or the spoken word,” read and reformatted and – more importantly – now able to be contextualized. In order to make any sense out of the non-repetitive data for use in the Data Warehouse, it must have the context of the data established.
Inman goes on to say that,
“In many cases, the context of the non-repetitive data is more important than the data itself. In any case, non-repetitive data cannot be used for decision making until the context has been established.”
Data Lakes and Data Warehouses: Mutually Exclusive or Perfect Partners?
Data Lakes have emerged onto the Data Management landscape in recent years, yet a Data Lake is not necessarily a replacement for the Data Warehouse, says Nick Heudecker, Research Director at Gartner, in Data Lakes: Don’t Confuse Them With Data Warehouses, Warns Gartner: “Data Lakes aren’t a replacement for existing analytical platforms or infrastructure. Instead, they complement existing efforts and support the discovery of new questions.” Once those questions are discovered, he says, you then “optimize” for the answers. “Optimizing may mean moving out of the lake and into data marts or Data Warehouses.”
- Data: A Data Warehouse only stores data that has been modeled/structured, while a Data Lake is no respecter of data. It stores it all—structured, semi-structured, and unstructured.
- Processing: Before a business can load data into a Data Warehouse, it first needs to give it some shape and structure—i.e., the data needs to be modeled. That’s called schema-on-write. With a Data Lake, you just load in the raw data, as-is, and then when you’re ready to use the data, that’s when you give it shape and structure. That’s called schema-on-read. Two very different approaches.
- Storage: One of the primary features of Big Data technologies like Hadoop is that the cost of storing data is relatively low as compared to the Data Warehouse. There are two key reasons for this: First, Hadoop is open source software, so the licensing and community support is free. And second, Hadoop is designed to be installed on low-cost commodity hardware.
- Agility: A Data Warehouse is a highly-structured repository, by definition. It’s not technically hard to change the structure, but it can be very time-consuming given all the business processes that are tied to it. A data lake, on the other hand, lacks the structure of a Data Warehouse—which gives developers and Data Scientists the ability to easily configure and reconfigure their models, queries, and apps on-the-fly.
- Security: Data Warehouse technologies have been around for decades, while Big Data technologies (the underpinnings of a Data Lake) are relatively new. Thus, the ability to secure data in a Data Warehouse is much more mature than securing data in a data lake. It should be noted, however, that there’s a significant effort being placed on security right now in the Big Data industry. It’s not a question of if, but when.
- Users: For a long time, the rally cry has been BI and Analytics for everyone. The Data Warehouse was built and invited “everyone” to come, but have they come? On average, 20-25% of them have. Is it the same cry for the Data Lake? Will the Data Lake invite everyone to come? Not if you’re smart. A Data Lake, at this point in its maturity, is best suited for the Data Scientists, says Tamara Dull.
The Data Warehouse Continues to Evolve
Bill Inmon sees great potential in the evolution of the Data Warehouse and it use moving forward. He says:
“Classic analytical processing of transaction-based data is done in the Data Warehouse as it has always been done. Nothing has changed there. But now analytics on contextualized data can be done, and that form of analytics is new and novel. Most organizations have not been able to base decision-making on unstructured textual data before. And there is a new form of analytics that is possible in the Data Warehouse, which is the possibility of blended analytics. Blended analytics is analytics done using a blend of structured transactional data and unstructured contextualized data.”
He adds, “. . . There are many other forms of analytics that are possible as well.” Such forms include Predictive and Prescriptive Analytics, as well as various Machine Learning technologies and others that are changing the way data is managed and analyzed. The Data Warehouse has long been a staple of enterprise Data Architectures, and according to experts like Inmon the Data Warehouse has a strong future in the new world of Big Data and Advanced Analytics as well.
Data Warehouses, just like other traditional Data Management tools, aren’t going anywhere; their importance will remain key to effective Data Management for many years to come.