The implementation of effective Data Governance is certainly complicated by Big Data. According to a recent report from Gartner, “Any organization thinking of simply applying existing information governance practices to big data will likely fail.”
Whereas conventional Data Governance programs are based on creating rules and definitions for proprietary data that has been accumulated specifically for its value to business and operations, Big Data Governance has to account for unknown sources, structure, and uses of data.
Big Data Governance principles, therefore, pertain as much to data that is cleansed and related to organizational processes as that which is not. They go beyond simply controlling data to enabling exploration to determine uses to business and operations that may not be apparent.
The distinction, then between conventional Data Governance and Big Data Governance is a shift in focus: the former is about enforcing conformity to data standards, while the latter is about mitigating risk and increasing trust to extrapolate value from Big Data.
Reducing Risk: Unwanted Data
One of the key differences between governing Big Data and internal data, Big Data Governance applies to data that has immediate value to the enterprise and that which does not. In the case of the latter, reducing risk simply means discarding data that does not immediately fit into a preconfigured structure or purpose for a specific organizational process.
The variation and quantity of Big Data, however, requires more time to determine useful relationships with proprietary data and Big Data. Less accessible data is not always less valuable; organizations are compelled to retain and govern such data until its value is determined. Facets of data quality that are usually easily determined must be adjusted in the case of Big Data Governance to allow sufficient time for exploration.
Adjusting Data Quality
There are a multitude of factors that are considered when determining Data Quality including accuracy, completeness, timeliness, and others. Conventional Data Quality tools and stewards who are involved in assuring quality typically analyze these factors from a simplified dichotomy in which the data either meets standards for a specific business or operations process, or it does not.
Quality standards for Big Data Governance – encompassing both tools and stewardship responsibility – are based on the same factors without the simplified dichotomy. Since it may be undetermined what sort of relationship a particular form of Big Data has to an organization’s objectives, there needs to be substantial revision to the standards whereby Big Data either passes or fails quality control. Doing so requires recalibrating tools and possibly reconstructing policy regarding the stringency of quality.
As advantageous as it may seem to make adjustments to Data Quality policies prior to implementing a Big Data initiative, organizations should remember they won’t fully know the various structures of data (and quality ramifications) until they are received. Subsequently, adjustments should be made both before and after launching the initiative.
Reducing Regulatory Risk
Another key aspect associated with risk reduction for Big Data directly relates to the governance of unwanted or unexplored data. Regulatory concerns apply to “useful” data as well as for that whose use has yet to be determined. Governance policies regarding regulations of data (which may vary by source, physical location, and industry) have to extend to unused data as well, especially since such data may be of immediate value to customers, competitors and, most alarmingly, to regulatory agencies.
Prudent organizations should analyze this concern before implementing a Big Data initiative and structure this facet of governance accordingly. As is the case with certain types of proprietary data, organizations may need to mask personally identifiable or sensitive information.
One of the key ways to foster trust in Big Data – especially that from new sources or unconventional structures – is to develop a system in which it is rated in terms of the user’s level of confidence. Data should be rated according to how recent it is, its level of quality, and the credibility of its source. Additional factors may include the data’s utility for a particular facet or organizational function, so that context also factors in to the reliability of a particular source.
Ratings should ideally be quantitative and provide an accurate depiction of the reliability of data at a brief glance. The greater the trust rating, the lower the risk is in using the data. It may also be useful to rate proprietorial data or data from internal sources the same way to provide quick comparisons between these and Big Data.
The point of rating data is to evaluate and increase trust. With the rapidity and variation of the massive quantities of Big Data, however, the focus of such ratings will inevitably be on the source of data, as opposed to the actual data (which is what conventional governance focuses on). Sources should be evaluated in terms of their past history of success (which requires an additional aspect of governance for keeping track of accuracy), as well as their alignment with confirmed or proprietorial sources. Additionally, factors such as the frequency of use and numbers of users of a Big Data source factor into its rating as well, and may be worth quantifying.
The Importance of Metadata
Big Data Governance relies on Metadata perhaps even more than conventional Data Governance does. There is a parallel between Big Data Governance’s increased reliance on Metadata and its increased reliance on data sources (and not the data itself),for the simple fact that the tremendous influx of Big Data requires more resources for sorting and stratifying data. Due to the massive amounts of Big Data, virtually any information about it (such as Metadata) can assist organizations in determining its use – especially when that information can be gleaned relatively quickly as opposed to waiting for data scientists to dissect it in sandboxes. Semantic technologies may provide the most consistent form of identifying Metadata of various structures.
Conventional Data Governance polices towards data cleansing revolve around the concept of ridding data of errors so that it is “correct.” The ambiguities of Big Data and their relationship to internal data and organizational processes, of course, make this concept somewhat contradictory. The initial focus of Big Data Governance policies for cleansing should be on exploration of its attributes and uses, until the structure of a particular source becomes familiar. Cleansing should take place afterwards. It may be more advantageous to do so at the application level, so that users can readily compare such data with others that have been cleaned, to assist in the process.
Conventional Data Governance essentially strives for pure or perfect data, whereas Big Data Governance largely exists to determine what data has the least amount of risk and yields the greatest reward. The key takeaway from this article is that Big Data Governance requires an adjustment of conventional policies and practices so that the untapped potential of Big Data is not circumscribed by governance.
Additionally, governance principles apply to both used and unused data, the latter of which should not be discarded simply because it does not adhere to conventional data quality standards. The objective of Big Data Governance is to increase confidence in its use so that it can successfully augment internal data and influence decision making. As such, there is an increased reliance on instantly identifying factors such as metadata and data sources that play key roles in establishing data quality and decreasing risk.