The Data Lake concept is intriguing, particularly in the wake of the ascendancy of Big Data initiatives and the value they add to one’s data assets through the integration of proprietary historic and/or transactional data.
The benefits derived from utilizing a highly scalable, distributed data store as a single repository (aka a data lake) for all one’s data—in its native format—include:
- Silo Elimination: In theory, independent data marts are no longer necessary as Data Lakes enable the enterprise to distance itself from a silo-based culture while emphasizing sharing and integration.
- Cost Reductions: The numerous cost reductions associated with deploying Data Lakes involve decreased ingestion costs for transforming data to a non-native format as well as less license and server fees. Open-source alternatives, such as deploying Hadoop as an integration hub, also help to reduce storage costs.
- Expedience: Instead of discerning data’s relevant attributes and then storing them according to those characteristics, data can be readily stored in a Data Lake to help account for Big Data celerity.
Nonetheless, these upfront advantages pose a number of complications over time pertaining to performance, user accessibility and Big Data Governance. Without appropriate governance measures, Data Lakes can create a ‘data free-for-all’ that exacerbates issues of data quality, data lineage, and pivotal aspects of the use of structured, semi-structured and unstructured data typically assigned to Data Scientists.
The most critical way in which Data Lakes can negatively affect Big Data Governance is by forgoing the discipline of Data Science, and effectively transmitting the burden of the responsibilities of these skilled statisticians to untrained end users. In addition to creating sophisticated analytical models to solve specific business problems, Data Scientists also discern attributes of semi-structured and unstructured data to determine their use for the enterprise.
Part of the purported appeal of Data Lakes is that they deliver access to all data throughout the enterprise. Thus, even when Data Scientists are employed to work on data in a lake, other users can tamper with and manipulate data for their own purposes—obfuscating data quality and data lineage. Even worse, when Data Scientists aren’t employed, end users are required to do their jobs. Doing so effectively could necessitate considerable training and additional funding.
Data Lakes present a number of complications for Metadata, particularly for semi-structured and unstructured data in which initial data attributes have yet to be determined. It is essential for data to adhere to Metadata conventions that have been formally defined by governance principles in order to ascertain meaning from data. Additionally, ensuring that data has uniform Metadata standards enables users to understand how data relates to other data—such as how proprietary data from CRM relates to sentiment data, for example. The danger with Data Lakes is that individual end users are liable to ascribe those attributes that they need data for within the specific context of their particular business problem or use for data—which may not follow governance conventions. As such, pivotal ramifications of orderly governance such as updates, replication, and de-duplication may become considerably difficult in Data Lakes.
According to Gartner: “…the lake itself does not does not provide native support for adding structure or meaning…In the data lake concept, this is by design. Technology could be applied or added to the lake to do this, but without at least some semblance of information governance, the lake will end up being a collection of disconnected data pools, or information silos, in one place.”
The concern for semantics in Data Lakes is similar to the concern for Metadata; without a dedicated individual to initially parse through the data to provision semantic consistency, user autonomy can result in disparate semantics. Basic governance principles such as glossary-based definitions become nearly impossible to achieve without consistent semantics. Instead, semantics will likely vary by user or department and confound governance outcomes of regulated, well-groomed data.
Moreover, there is a distinct possibility that certain users will not have the requisite knowledge and skills to successfully manipulate semantics or to even implement them in data, which greatly reduces the value of semantics applied to semi-structured and unstructured data. That value is predicated on the fact that semantics reduces all data to a triple in which they are described with a subject, verb and an object, which makes them easy to compare, integrate, and aggregate with structured data.
The negative ramifications of collecting all data in a single repository without consistent definitions, Metadata, and semantics are that organizations are subject to regulatory issues and security concerns. With enterprise-wide access to Data Lakes (which is regularly touted as one of the boons of this approach), organizations effectively lose one of the most fundamental security measures related to data—regulating access to data by data type, use, department and a host of other factors relevant to an organization’s business and use cases. In highly regulated industries such as finance and health care, Data Lakes could pose significant risks.
Much of the appeal of Data Lakes exists for administration and IT departments. Data Lakes are viewed as a way of reducing architecture and simplifying the process of ingesting and storing data. Gartner observed that, “Data lake hype is ramping up significantly and IT is rushing out to buy the technology.” Despite the cost benefits and the simplified architecture enabled by depending on a single repository, the myriad governance issues associated with Data Lakes inevitably affect performance and data quality, which can defeat the purpose of utilizing data driven processes in the first place.
Initially, Data Lakes existed for Data Scientists, and functioned as a means for these professionals to evaluate new data types and sources and create the degree of Metadata and semantic uniformity that is necessary for such semi-structured and unstructured data to follow governance conventions. The burgeoning popularity of Hadoop largely contributed to the concept of extending the functionality of Data Lakes throughout the enterprise, mostly because of its accommodation for Big Data, its enormous scalability, and the fact that myriad vendors helped to reinforce its status as a de facto Big Data platform by creating numerous tools and platforms to enhance its capabilities. These include databases that allow SQL and SQL-like querying with Hadoop, as well as options to expedite analytics into real time and eschew Hadoop’s batch-oriented processes.
The future of Data Lakes depends on their capacity to reinforce governance and address inherent security issues. An article from Forbes denotes the evolution of Data Lakes in four phases. The current phase, which involves Data Lakes after Hadoop’s introduction to the data sphere, emphasizes the need for dedicated governance measures pertaining to hallmarks such as Metadata Management, semantics uniformity, and improved security measures. Yet these measures are discussed in a future context and represent for Data Lakes, in addition to their relevance to Cloud applications, the final frontier. Forbes observed that:
“In reality, many organizations are just starting to kick the tires of Hadoop. Of those enterprises who are using Hadoop, most are in the early stages of this process…with a few front-runners…Those organizations are big enough to face and invest in solutions to challenges that the vendors haven’t yet stepped up to, such as managing provenance, data discovery, and fine-grained security.”
Until then, the utility provided by Data Lakes may be outweighed by their complications to Data Governance.