The perceived boons of utilizing a single repository (typically either Hadoop or another NoSQL platform) to store and access all data—structured or unstructured, regardless of schema—as a Data Lake are touted as:
- Universal enterprise access: Data Lakes enable all users to quickly access data (without time consuming and rigorous modeling constraints) from a single place, which purportedly assists with data lineage.
- Increased agility and speed: The enterprise-wide accessibility associated with Data Lakes is believed to increase agility, partly due to the fact that data is moved to such lakes in its native format for ready comparison with other types of data.
- Reduced infrastructure costs: Depending on a single repository to access data decreases costs associated with physical infrastructure.
However, as explained in a previous article, those same benefits provide substantial challenges to Metadata, semantic consistency, security and risk, all of which proper Big Data Governance is designed to uniformly manage and control.
This article provides solutions to the aforementioned governance complications Data Lakes create, which can strike a delicate balance between their drawbacks and benefits.
Organizations that utilize Data Lakes should be aware of these three caveats:
- History: Historically, Data Lakes were used by Data Scientists to analyze new data types and determine their relevant attributes to the enterprise. As relatively small, useful tools for these employees, Data Lakes eschew the governance exacerbations that arise when they are extended throughout the enterprise.
- Vendors: According to Gartner: “Several vendors are marketing data lakes as an essential component to capitalize on big data opportunities, but there is little agreement between vendors about what constitutes a data lake, or how to get value from it” (Heudecker and White, 2014). Additional concerns pertaining to vendor solutions discussed in this article relate to vendor lock-in, in which organizations are stuck utilizing a proprietary technology that may have substantial Metadata repercussions.
- Snapshots: The solutions discussed in this article represent technology that was available as of this writing. Vendors are continuously developing solutions to address ongoing issues related to Data Lakes, which are subject to change.
Organizations that utilize Data Lakes must account for Metadata, which helps to provide the foundation for proper governance by solidifying the relevant attributes of data and classifying them in a consistent way. The lack of Metadata consistency associated with Data Lakes can involve both informal, manual solutions and formal, automated ones. The former is largely based on patterns related to use and queries issued by users, and is far less effective than the latter in that use cases for particular data will likely not be enterprise wide and can contribute to a fragmented approach to Metadata that varies by department. Nonetheless, this method allows for some degree of Metadata Management, although it is probably insufficient for long periods of time. An IDC article discussing Metadata according to use states:
“Through various processes, metadata is created. These processes…can be informal, based on accumulated data queries. And some may be based on pattern recognition. Based on the metadata and search histories, references to data sets can be created. Some systems can be defined to create informal schemes, formats and subsets based on these data sets.”
A more rigorous method of applying Metadata to data in Data Lakes involves automating that process through any variety of vendor solutions. Many of these involve semantics; the real-time applicability of these products is likely to vary by vendor and may be inconsistent. Several of these options are specifically tailored for Hadoop, while there are some that apply to NoSQL stores as well. In general, most of these apply a Metadata layer that enables organizations to input and adjust their Metadata Registry (based on critical business requirements) so that certain aspects of Metadata are automated as data is ingested into the lake.
This process can be optimized by ensuring that data movement adheres to predefined workflows, a situation which, as recently noted by a Knowledgent blog, “if these workflows include steps like lineage recording and usage tracking, metadata capture could be guaranteed…” Automated Metadata solutions are equipped with APIs to facilitate integration between sources, and frequently track data lineage, data transformation and, in certain instances, may provide functionality for advanced analytics. A good example of such a solution is Revelytix’s Loom.
The role of semantics in effecting Big Data Governance for Data Lakes is wedged between Metadata and Data Science. The usage of semantics can identify data and derive meaning from them regardless of structure or lack thereof; semantic technologies are partly responsible for the automation of Metadata and familiarizing users with the connotations for certain data and data types. Knowledgent indicated that:
“Semantic technologies like domain ontologies can be used to not only capture physical metadata, but also to append business context to the data. This would allow end users to locate and purpose data to their needs without interpretation by data specialists.”
Yet another option for controlling the chaotic Data Governance situation that Data Lakes can present is to have them regularly curated by Data Scientists. The practical applicability of this option in the context of near real-time Big Data sets is certainly less useful than it would be with smaller, less time-sensitive sets of data. Still, involving Data Scientists in the monitoring and curating of data in Data Lakes directly addresses the principle governance problem associated with these repositories—that they eschew Data Science and deliver data in their raw form to end users. Data Scientists can perhaps add to their lengthy list of tasks by grooming Data Lakes, providing or ensuring semantic consistency, and helping to blend one of the most vital aspects of their jobs (determining data attributes and their relevance to business needs) with governance responsibilities. How much of this they will be able to do in a time frame that is useful for users depends on the number deployed and the specific use case.
Role based access to data is a hallmark of effective Data Governance, and one of the points in which this field intersects with enterprise security. This concept takes on renewed emphasis with Data Lakes and Big Data Governance, since one of the characteristics of the former is that everyone can access data from the same place. There are some vendors such as Hortonworks that offer solutions specific to Data Lakes (many of which involve Hadoop, while some support NoSQL stores) for limiting user access according to roles, providing auditing capabilities, and increasing security measures. In some instances, the aforementioned semantics and Metadata processes can also help to tag data with security information as it enters Data Lakes. Expect vendors to increase the degree of specificity of role-based access in the near future to stratify it according to data type, use case, and department.
Although most vendors target their Data Lake options for Hadoop, there are a number of Big Data Governance concerns that are specific to NoSQL stores—whether or not they are expressly used for Data Lakes. Because of the schemaless environment they offer and the ability to readily store unstructured and structured data in one place, these stores are frequently used to expedite application development, which Gartner indicates shifts the focus on “the drive to solve today’s problems without considering how data will be used in the long term” and subsequently “impacts the viability of data stored in NoSQL databases” (Heudecker and Friedman, 2013).
Additionally, several NoSQL stores (and even certain Hadoop deployments) lack business logic and the governance mechanisms that are easily enforced in the rigid data modeling of their relational counterparts. Organizations can either account for this fact by implementing their business logic in individual applications in a decentralized manner (which may be time consuming when changing requirements) or in a common layer such as via the Cloud. If utilizing the former approach, it is best to effect business logic during the application development process, and not during production. Either method helps to balance the fleeting needs for data during app development with the long term, reusable capabilities of well governed data.
Implementing the aforementioned solutions can help to mitigate the Big Data Governance complications of Data Lakes while preserving their value. Ongoing developments in vendor capabilities should help users leverage semantics for Metadata and address security concerns, while the curation efforts of Data Scientists can help to smooth the process. But, like many facets of contemporary Data Management, the technologies and techniques for utilizing Data Lakes is still maturing and is a work in progress.
Heudecker, N., White, A. (2014). The data lake fallacy: all water and little substance. www.gartner.com
Heudecker, N., Friedman, T. (2013). Does your NoSQL DBMS result in information governance debt? www.gartner.com