Data Lakes: What They are and How to Use Them

By on

Click to learn more about author Jaya Shankar Byrraju.

For most companies, having data means having access to wealth. And the key to fully leveraging the wealth that data represents lies in how effectively companies harness, manage, parse, and interpret it. But first, the data must exist somewhere.

Enter data lakes. These are central repositories of data in a natural or raw format, having been pulled from a variety of sources, such as analytics systems, machine learning systems, dashboards, visualizations, social media, mobile apps, etc. Relational or non-relational, this data can come in as structured data from line-of-business applications or as unstructured data from IoT environments. In fact, it’s because so many new technology sources provide data in an unstructured way that data lakes came about in the first place; traditional big data solutions, namely data warehouses, were designed to store only structured data.

By enabling organizations to store and manage data in its original form, data lakes provide data scientists, data architects, data analysts, and others the flexibility to analyze and build optimized data architectures, even on the fly. With that, they allow organizations to harness more data from a variety of heterogeneous sources to create actionable insights, improve R&D decision-making, sift quickly through big data, make profitable predictions, and better control data movement, all while achieving business efficiencies and better customer engagement. Integrate with existing ERP or CRM solutions, for example, and you have the ability to analyze 360-degree profiles of individual customers and their behaviors for building mass-customization solutions.

The Mechanics Behind Data Lakes

As highly scalable “melting pots” of information points, data lakes can be housed on-premise, in the cloud, or as part of a hybrid solution, and can be established using multiple tools and frameworks. Following are some examples of those tools and some common brands associated with each task:

  • A highly scalable, distributed file system to manage huge volumes of data (e.g., Apache Hadoop Distributed File System or HDFS)
  • Highly scalable data storage systems to store and manage data (e.g., Amazon S3)
  • Real-time data streaming framework to efficiently move data between different systems (e.g., Apache Kafka)
  • Tools to run massive and parallel data queries (e.g., Apache Hive)
  • Tools to process and generate huge data sets (e.g., MapReduce)
  • Data lake RESTful API (e.g., Amazon API Gateway)
  • Tools for secure signing (e.g., Amazon Cognito)
  • Tools to run advanced and sophisticated analytics (e.g., Microsoft Machine Learning Server)

It’s important to note that as companies use multiple software systems to collect customer data, growing data volumes that generate poor-quality data (and expensive Data Management solutions to handle them) pose a big challenge for companies. The market has responded with a number of critical analytics solutions and machine learning algorithms to help companies address these data-related challenges.

Offering horizontal scalability and a distributed file system, Apache’s Hadoop open-source framework is the most popular analytics solution for big data. Apache HBase can be used to host very large tables on top of HDFS. Apache Hive is another tool that works on top of Hadoop for query and analysis. Apache Spark and Apache Kafka can be used for a cluster-computing framework and real-time streaming of data. Presto, another open-source tool, is a high-speed SQL query engine. Microsoft Machine Learning Server is an enterprise-level platform for data analytics at scale. Amazon S3 can serve as a cost-effective data storage option. Microsoft HDInsight is a popular data lake analytics platform that enables businesses to apply all popular analytics tools and frameworks on data lakes using pre-configured clusters. Azure and AWS offer end-to-end tools to efficiently manage data lakes.

Leveraging Data

The key to businesses’ ability to fully leverage data lies in how well they manipulate and interpret the vast wealth of information and, specifically, how quickly they can move data into data lakes and then extract insights from it. To do this, a proper data lake architecture must be implemented. Following are five key components of a data lake architecture:

1.Data Ingestion: A highly scalable ingestion-layer system that extracts data from various sources, such as websites, mobile apps, social media, IoT devices, and existing Data Management systems, is required. It should be flexible to run in batch, one-time, or real-time modes, and it should support all types of data along with new data sources.

2.Data Storage: A highly scalable data storage system should be able to store and process raw data and support encryption and compression while remaining cost-effective.

3.Data Security: Regardless of the type of data processed, data lakes should be highly secure from the use of multi-factor authentication, authorization, role-based access, data protection, etc.

4.Data Analytics: After data is ingested, it should be quickly and efficiently analyzed using data analytics and machine learning tools to derive valuable insights and move vetted data into a data warehouse.

5. Data Governance: The entire process of data ingestion, preparation, cataloging, integration, and query acceleration should be streamlined to produce enterprise-level Data Quality. It is also important to track the changes to key data elements for a data audit.

After a data lake architecture is established, it’s time to integrate the architecture with existing infrastructure, which can require a sizable upfront investment in human resources. During the design and implementation phase of a data lake architecture, a company can easily use the services of solution architects, data engineers, user-interface developers, ML engineers, support engineers, technical leads, project managers, quality assurance professionals, and possibly others.

Once the integration of the data lake system is successful, some 40 percent of those resources are likely to remain on the team. In all, it is estimated that, depending on the tools and resources incorporated into the architecture, data lake onboarding can cost an organization anywhere between $200,000 and $1 million.

What the Investment Yields

The cost of data lake implementation, however, is relative. A survey conducted by the Aberdeen Group reports that organizations that have successfully leveraged a data lake architecture have seen organic, year-over-year growth at a rate 9 percent higher than their competitors. It shows that investments in data lakes are providing returns in the way of increased operational efficiency; lower transactional costs; cleaner, faster, and more relevant information; and “game-changing” business insights. To name a few examples:

  • Data lakes can give retailers profitable insights from raw data, such as log files, streaming audio and video, text files, and social media content, among other sources, to quickly identify real-time consumer behavior and convert actions into sales. Such 360-degree profile views allow stores to better interact with customers and push on-the-spot, customized offers to retain business or acquire new sales.
  • Data lakes can help companies improve their R&D performance by allowing researchers to make more informed decisions regarding the wealth of highly complex data assets that feed advanced predictive and prescriptive analytics.
  • Companies can use data lakes to centralize disparate data generated from a variety of sources and run analytics and ML algorithms to be the first to identify business opportunities. For instance, a biotechnology company can implement a data lake that receives manufacturing data, research data, customer support data, and public data sets and provide real-time visibility into the research process for various user communities via different user interfaces.

The increasing need to quickly extract insights from data to stay ahead of the competition while making the business future-proof is driving organizations toward data lake innovation. MarketsandMarkets reports that the global data lake market was valued at $7.9 billion in 2019 and is expected to grow at a compound annual growth rate (CAGR) of 20.6 percent by 2024 to reach $20.1 billion. Figures from Mordor Intelligence put the 2019 global data lake market at $3.74 billion and expects it to reach $17.60 billion by 2025, showing a 29.9 percent CAGR. Regardless of the precise figures, both calculations speak volumes about data lakes’ ability to help businesses optimize resources, reduce operational costs, and increase productivity. Are you taking advantage of this specialized technology?

Leave a Reply