A data lakehouse is a data storage space containing unstructured and structured data. While data warehouses can store structured data and data lakes are designed for unstructured data, the data lakehouse was developed to provide a resource for both types of data.
Data lakehouses allow users to combine the structure and features of a data warehouse with the low-cost storage aspect of a data lake. It enables companies to focus on one resource for all their data requirements irrespective of their project needs – be it data science, business analytics, or even machine learning.
Data lakehouses include features such as:
- Metadata layers: Metadata layers provide metadata about all objects in the data lake. They track files of different table versions to offer features like ACID transactions, data streaming, data indexing, schema enforcement and evolution, data quality, and validation.
- Query engine designs: Data lakehouses run with high-performance SQL execution. This has the added advantage of allowing multiple queries on a given data source from different users at the same time.
- Optimized access: API tools associated with the data lakehouse have direct access to many data types. You can also use them to refine and analyze these formats. They can store data in open file formats, making it accessible to machine learning and analytics tools.
Other Data Lakehouse Definitions Include:
- “An open data management architecture that combines the flexibility and cost-efficiency of data lakes with the data management and structure features of data warehouses, all on one data platform.” (Talend)
- “A new, open data management architecture that combines the flexibility, cost-efficiency, and scale of data lakes with the data management and ACID transactions of data warehouses, enabling business intelligence (BI) and machine learning (ML) on all data.” (Databricks)
- “A modern, open architecture that enables you to store, understand, and analyze your data. It combines the power and richness of data warehouses with the breadth and flexibility of the most popular open-source data technologies you use today.” (Oracle)
Use Cases Include:
- Experian improved performance in data processing by 40% and reduced costs by 60% when it moved critical data into a data lakehouse on Oracle Cloud Interface (OCI).
- Groupe France Mutuelle moved its core administration system to IBM Db2, cut operational and administrative costs by 80%, and experienced an added advantage of a 30% reduction in time taken for data analysis.
- eMAG employed Dremio to obtain a radical technical and business insight advantage. It included automating data queries, cataloging data, and documenting data lineage on the technical side. eMAG noticed report generation time decreased by 50 to 75% (from weeks down to hours), and the time to analyze marketplace data reduced from an average of one day to 20 minutes.
- Allows users to have direct access to resources, improving self-reliance
- Accounts for Data Governance measures at the foundational stages
- Drastically reduces the likelihood of data redundancy
- Decreases time spent on streamlining data for storage
- Reduces the probability of the creation of data swamps
- Ensures data consistency and reliability to maintain data quality
- Supports a broader range of data formats, including real-time data
- Offers greater scalability for users through several clusters of storage
Image used under license from Shutterstock.com