WANT TO BE A CERTIFIED DATA MANAGEMENT PROFESSIONAL?
Our online training program in the DMBOK and CDMP exam preparation provides a solid foundation of different data disciplines.
Click to learn more about author Gilad David Maayan.
A Data Warehouse is a central repository of integrated historical data derived from operational systems and external data sources. As a central component of Business Intelligence, a Data Warehouse enables enterprises to support a wide range of business decisions, including product pricing, business expansion, and investment in new production methods.
Aside from its role in facilitating analysis and reporting, a Data Warehouse provides the following uses for enterprises:
- Keeping data analysis separate from production systems. Operational databases used daily by enterprises are not equipped to run complex analytical queries. The Data Warehouse allows enterprises to run such queries without affecting production systems.
- Bringing consistency to disparate data sources.
- Optimized design for analytical queries.
The emergence of Cloud Computing over the last five years has significantly impacted Data Warehouse architecture, leading to the increasing popularity of Data Warehouses-as-a-service (DWaaS). In the rest of this article, you’ll find out the differences between traditional Data Warehouses and Cloud-based Data Warehouses.
Traditional Data Warehouse
The traditional Data Warehouse requires the provisioning of on-premise IT resources such as servers and software to deliver Data Warehouse functions. Enterprises running their own on-premise Data Warehouses must effectively manage infrastructure too.
Traditional Data Warehouses are divided into a three-tier structure as follows:
- The bottom tier contains the Data Warehouse server, with data pulled from many different sources integrated into a single repository.
- The middle tier contains OLAP servers, which make data more accessible for the types of queries that will be used on it.
- The top tier houses the front-end BI tools used for querying, reporting, and analytics.
To pull data into the unified repository, ETL (Extract, Transform, and Load) tools are typically used to take the data from various sources, blend it and apply business rules to get it into the correct structure for querying, and finally load the data into the Data Warehouse.
Two contrasting approaches to traditional Data Warehouse design reflect the differing opinions of two computer science pioneers, Bill Inmon and Ralph Kimball.
- Bill Inmon’s top-down approach suggests that the Data Warehouse is the centralized repository for all enterprise data. Dimensional data marts, serving particular lines of business (e.g. finance) are created from the Data Warehouse.
- Ralph Kimball’s bottom-up approach posits that the Data Warehouse emerges as a result of combining different data marts.
Cloud Data Warehouse
The Cloud-based Data Warehouse approach leverages Data Warehouse services offered by public Cloud providers such as Amazon Redshift or Google BigQuery.
By offering Data Warehouse functionalities which are accessible over the Internet, public Cloud providers enable companies to eschew the initial setup costs needed to build a traditional on-premise Data Warehouse. Furthermore, these enterprise Data Warehouses in the Cloud are fully managed, so the service provider manages and assumes responsibility for providing the required Data Warehouse functionality, including patches and updates to the system.
Cloud Architectures are somewhat different from traditional Data Warehouse approaches. In Redshift, for example, the service operates by requiring you to provision a cluster of Cloud-based computing nodes, some of which compile queries while some execute those queries. Google offers a serverless service, meaning Google dynamically manages the allocation of machine resources, abstracting such decisions away from users.
A level of Data Warehouse optimization is achieved in the Cloud that is tough to match with the limited power of an on-premise setup. Columnar storage, where tables values are stored by column rather than row, caters for much faster aggregate queries, in line with the type of queries you need to run in a Data Warehouse. Massively parallel processing is also an important feature that dramatically improves query speeds by coordinating query processing for large datasets using many machines.
In terms of scalability, in the Cloud, it’s as easy as provisioning more resources from the Cloud provider. However, on-premise scalability is time-consuming and costly, necessitating the purchase of more hardware.
Security is a tricky issue in the Cloud—sending terabytes of data over the Internet brings serious security concerns, and perhaps some compliance concerns too for sensitive data. On-premise setups avoid such concerns because the enterprise controls everything.
The low barriers to entry in the Cloud help make Data Warehousing more accessible for small and medium-sized companies. Furthermore, even the largest enterprises can benefit from lower costs, such as ongoing management of the infrastructure and effortless scalability.
The Cloud is not without its issues, such as potential security concerns, however, the benefits outweigh the negatives. Legacy on-premise setups aren’t entirely obsolete, but data volume and velocity keeps growing, and Cloud-based services are designed to handle this. Furthermore, as more workloads move to the Cloud and more companies enter the market as service providers, it appears that the future of Data Warehousing lies in the Cloud.
For more information on Data Warehouse basics, check out this Data Warehouse guide.
Image Credit: Agile SEO