Optimizing the Data Warehouse

By on

The data warehouse, a relational database technology, makes all enterprise information actionable, and will continue to be prominent as a Data Architecture component. In the 2000s, a typical business would consolidate data from multiple relational databases, centralizing all this information through a data warehouse, and consequently streamlining business tasks. However, the business context has shifted and continues to change from static and centralized to real-time and voluminous.

Around the 2010s, data warehouses faced significant challenges in keeping up with data amounts and varieties, which were generated much faster than before. Meanwhile, business-decision timeframes grew smaller. For example, the internet of things (IoT) alone “will create about 847 Zettabytes (ZB) per year by 2021, up from 218 ZB per year in 2016.” 

Many IoT devices, like the Nest thermostat, need to adapt and adjust well at the click of a button. A data warehouse, alone, works too slowly and is too complex to handle this reality, calling into question the data warehouse’s role as a primary repository. However, a data warehouse, due to its integrity, availability, familiarity, and consistency, remains required. First, a growing number of regulations in the 2020s, such as the General Data Protection Regulation (GDPR) and the California Consumer Protection Act (CCPA), have required data sets to be traceable and reproducible at any given time frame.

Also, business analysts want to query, filter, and group data sets without needing specialized technical skillsets. Finally, relational Database Management Systems (RDBMS), make up 80 percent of the total operational database marketplace and 70 percent of new applications. These RDBMS will require data warehouses to transform this information and make it business-ready. Data warehouses are here to stay.

To adapt, the data warehouse will become an integral, modern component of Enterprise Data Architecture by:

  • Utilizing its ability to provide context for information.
  • Integrating with data lake architectures, which are excellent at accessing vast amounts of raw data.
  • Leveraging cloud computing, remote computation, data access, and storage services.
  • Demonstrating sound Data Governance practices and processes.
  • Using AI and machine learning (ML) for routine data warehouse management tasks.

Providing Information Context — A Data Warehouse Strength

Data warehouse architecture, whether legacy or modern, stands apart in providing context around data, and thus, information. A data warehouse organizes data from multiple relational sources around a schema or a subject area. Then, the data warehouse uses extract, transform, and load processes.

Image Credit: DAMA International

Since a data warehouse transforms information around a fixed model, representing specific business purpose(s), the resulting analysis and reports remain consistent and durable. The information organizational strength of a data warehouse provides accurate historical information and beneficial insights.

Integrating the Data Warehouse with the Data Lake and the Data Lakehouse

Despite its strength, the data warehouse, alone, fails as a single source of truth for all enterprise data and benefits integrating with a data lake. A data lake has non-relational technology, allowing it to ingest large amounts of structured and unstructured data without breaking or slowing down, and to do this in real-time.

Furthermore, the data lake obtains data sets more easily in their pure forms, however, it requires advanced technical skill to retrieve meaningful information from them. Combining both the data warehouse and data lake capabilities has become ideal in leveraging their advantages.

Architecturally, data warehouses work independently within a data lake or combine with a data lake to form a data lakehouse. In the first case, for example, Yoo and Yoo put an open-source data warehouse (Hive) on top of a data lake framework (Hadoop), to process and analyze supply chain management information from Radio Frequency Identification (RFID) tags. An algorithmic connector allowed Hive and Hadoop to communicate well with each other.

Mixing the data warehouse and data lake architectures creates an emerging platform — a data lakehouse.

Roughly, a data lakehouse reshapes raw data into objects that can be consumed by business intelligence applications as if they were coming from a data warehouse. Uniting a data warehouse and a data lake through data translation or a different architectural structure enhances the powers of each.

Leveraging Cloud Computing

Put a data warehouse in the cloud (a “cloud data warehouse”) and get advantages of a data warehouse and lake at a grander scale. Cloud computing, as compared to on-premise, has more space and can scale up faster, handling various structured and semi-structured data types. Third-party vendors make these capabilities possible by providing the Software as a Service (SaaS) needed for storage, computing, and retrieval processes. Think of a cloud data warehouse as a warehouse within a lake or a data lakehouse having a more significant, better-performing infrastructure.

Why not use the cloud data warehouse as a centralized repository? Cloud data warehousing, although expanding the power of a data warehouse, may not be suitable for all business tasks. Some enterprise data does not need to be stored, computed, or retrieved. At the same time, the “meter is always running.” This is especially the case for infrequent projects, such as inventory checks or audits. Enter hybrid computing, where cloud and on-premise data warehouses and other data sources work together to make data ready and usable.

The Data Warehouse as an Enterprise Data Architecture Component

Data lake and cloud computing architecture make the lone, legacy data warehouse less viable long-term as an enterprise data hub. Instead, a data warehouse needs to maintain its specialty, organizing structured data while exchanging information with other enterprise data systems.

This reality is in line with a 2019 Percona Live report, showing a shift to and desire for multi-database systems or MDDBMS, containing from 25 to 10,000 database instances, including data warehouses. Technical advances in data transformation and data virtualization make MDDBMS even more attractive. While the backend data architecture may be involved, a middle layer simplifies it. The analyst or end-user gets the information all in one place, through data virtualization. Freed from the expectation of being one database system ruling all of the others, the database warehouse can specialize, and capitalize on its strengths.

Demonstrating Sound Data Governance Is Critical

The trend towards MDDBMS means that any data warehouse has to demonstrate sound Data Governance. Data Architectures have become a little more complicated. A review of literature and policies regarding data access in clinical data warehouses, finds and emphasizes the need for formal management of data access (Data Governance).

As data warehouses need to integrate and choose which information to share with other data systems, a governance structure needs to inform how data warehouses should proceed. Employees and enterprise technological assets need to coordinate, through Data Governance inputting, transforming, deleting, and using data — keeping good quality and value in the data warehouse.

Using AI and Machine Learning

Using AI and machine learning to save time and money managing and maintaining the data warehouse makes sense. Gartner predicts that through the end of 2022, Data Management manual tasks, including some database management ones, will be reduced by 45 percent through augmented Data Management. Increased Data Management with a durable machine learning aspect makes the data warehouse more self-managing, self-securing, and self-repairing — better able to function within the enterprise data architecture and better positioned for the future. By augmenting the data warehouse, more bandwidth becomes available to add and manage another data source servicing a different set of business needs.

As automation and AI and machine learning (ML) clean, backup, synchronize, and secure the data warehouse, it becomes more capable. It can handle training data sets needed to make AI and machine learning operate well to augment analysis.

Business then benefit as AI and ML escalate data patterns and trends not noticed by human eyes. Having a well-functioning data warehouse that can utilize and support AI and Machine Learning means discovering business efficiencies sooner. Copper-mining giant Freeport-McMoRan reaped the benefits of this.

The data warehouse depends on an excellent Data Governance program. Without this, data quality becomes suspect, and the data warehouse backlogged in trying to integrate with other data systems and use its specialization to the fullest. Technical advancements, such as data lake architectures, cloud computing, virtualization, AI, and ML, mitigate data warehouse weaknesses and allow it to perform better in organizing information contexts for business analysis. But these only work as far as a formalized Data Governance structure oversees the entire Enterprise Data Architecture, including the data warehouse’s role. From Data Governance, the data warehouse transforms enterprise data into business knowledge that is ready and actionable.

Image used under license from

Leave a Reply