Michele Goetz at Forrester recently wrote that “…Big Data is creating data swamps even in the best intentioned efforts…Analysts will still spend up to 80% of their time just trying to create the data set to draw insights.” Implicit in this statement is the reality that whether organizations are attempting to leverage data lakes (such as Hadoop) or other means of storing and accessing Big Data, the sheer quantities and variety of those data can overwhelm those without the most strident Big Data Governance practices already in place.
Moreover, the multitude of self-service tools can create the tendency to perform analytics in near real time to take advantage of fledgling opportunities while forsaking governance hallmarks of Data Quality, cleansing, lifecycle management, and regulatory conformity.
Organizations can tremendously increase Big Data Governance efficacy by focusing on the data preparation process that can result in more expedient and reliable access to data. Forrester indicated that this process and a set of newfound tools to facilitate it “are the catalyst to bringing trust, speed, and actionable insight for all data where traditional data governance and management tools have hit the wall.”
Data Preparation
Enterprises can avoid the dreaded data swamp phenomenon by sufficiently preparing for the unstructured, semi-structured, and schema-less nature of Big Data by focusing on data wrangling. Wrangling (or the taming of unstructured data) is a critical component of data preparation, which Computer Weekly defines as, “the manipulation and transformation of data, from its raw core, into a form suitable for analysis and processing.” Conducting this process in a manner that is consistent, timely, sustainable, and congruent with standard names and definitions that extend across the enterprise is a critical component of governance, particularly for Big Data. The myriad functions of the data preparation process include:
- Ingesting
- Integration
- Aggregation
- Cleansing
- Enrichment/Enhancement
- Structure provisioning
- Organizing data transformation sequencing
- Standardizing data
Data preparation functions at the cross section of a number of critical facets of Data Management. Not only is its implementation critical for Big Data Governance, but it also is necessary for the proper provisioning of analytics. There is an element of data discovery inherent in the preparation process in which the type, quality, and values of the data is discerned to fit into organizational processes. In this respect, data preparation is sometimes considered a realm of Data Science.
Automation Tools
The normalization of Big Data across vertical industries has resulted in a burgeoning market for automated data preparation tools which help with the expedience of Big Data and optimizing them for business or analytics consumption. The need for such tools stems from the fact that, as indicated by Oracle, as much as “90% of the time in Big Data projects is really spent on data preparation.” Many of the steps required for the preparation process of little data are considerably exacerbated when utilizing the immense amounts of varied Big Data. For instance, it is necessary to begin by selecting data for a particular business function—with Big Data, it is prudent to start by governing the source of data before scrutinizing individual elements. It is then necessary to determine relationships between data and data sources, structured or otherwise, before extracting and organizing the data to load into any variety of platforms. Recent offerings by vendors such as Paxata, Oracle, SAS, and others can significantly enhance this process with predictive analytics to make it more intuitive and expeditious. A number of conventional Data Governance vendors have platforms that engage in some aspects of this process. Without adhering to it in some form, however, one runs the risk of creating a disorganized data swamp that defies governance protocols.
On-the-Fly Machine Learning
Automated data preparation tools can function as the initial point of Big Data Governance by accommodating the unstructured and schema-free orientation of Big Data with Machine Learning algorithms. This subset of predictive analytics not only makes the data preparation process more readily available, but also incorporates an element of self-service into this critical aspect of governance. Machine Learning capabilities enable solutions to discern statistical similarities in different data and can leverage them to create schema based on the data themselves and not some pre-defined model. Some offerings combine Machine Learning with Natural Language Processing to achieve a similar effect via text mining. Additionally, these algorithms are used to produce recommendations about transformation methods and shortcuts for best adding structure to data. Frequently, such tools are also equipped with Data Visualization technologies that are designed to facilitate the preparation process intuitively, and are accessed either on-premise or through the Cloud for use with popular Big Data platforms such as Hadoop.
Democratization of Governance
By increasingly involving laymen and business users in the data preparation process, these automated offerings help to democratized Big Data Governance. The self-service nature of this process helps to make ‘citizen’ stewards out of those who do not otherwise expressly have a role in Data Governance. Moreover, they help to ingrain governance principles within the business, which are the primary users and (purportedly) beneficiaries of both data driven practices and Data Governance. Users also have the additional benefit of being able to prepare data in way that helps their own use cases for it—while doing so in a manner that is consistent with enterprise-wide governance conceptions. Thus, Big Data is loaded and accessible into data lakes or other platforms in a way that adheres to governance, and still offers the boons of expeditious access for analytics.
Metadata
The incorporation of metadata into data lakes and Big Data stores is essential to prevent them from sprawling into data swamps. There are numerous vendors who have tools and platforms specifically designed to add metadata into common Big Data hubs such as Hadoop. Frequently, upstream data sources provide metadata so that organizations are not necessarily burdened by the need to take action to facilitate metadata specifically for data lakes. The trick, however, is to standardize that metadata to provide uniform recognition of data elements, or perhaps to leverage that uniformity in a way so that access is restricted across business units or by use case per governance mandates.
Data Ratings
Providing ratings of data elements based on a couple of different factors is an integral aspect of including metadata in data lakes. Although doing so is not a substitute for formal Data Quality measures, rating data based on metadata and the value individual elements have provided a particular source system can help to facilitate another layer of clarity in Big Data stores. Again, it may be advantageous to add metadata which provides informal, quantitative measures of that value, as well as metadata that indicate what processes for which specific data elements were used. Other options for ratings include each user adding their own metadata rating of the element after use, in which case it might be necessary to average those ratings for a composite one.
Effective Big Data Governance
Organizations can avoid data swamps by attending due diligence to the data preparation process. Whether or not they take advantage of the myriad preparation tools for Big Data or those for conventional Data Governance, scrutinizing this process can not only improve governance, but various facets of analytics and Data Science. The preparation process is considerably augmented by the incorporation of metadata and Metadata Management in data lakes, as well as by rating data elements according to their value in their initial sources and according to users.