In theory, the governance of Big Data is not much different from that of conventional data. There are greater quantities of the former that are ingested more rapidly, which require the dedication of more effort and resources for effective governance. But the principle goals of governance remain unchanged: to give a formal accountability of who has access to data, how it’s defined, and to increase its reliability so that it enhances business and operational processes.
In practice, however, there are a number of challenges associated with the governance of conventional data that become magnified when applied to Big Data. Foremost among these are issues of privacy and regulations pertaining to public sources of Big Data such as social media and the Internet that seem to be increasing (or at least changing) almost daily.
The enormous speeds and amount of data processed with Big Data technologies can cause the slightest discrepancy between expectation and performance to exacerbate quality issues, many of which may be compounded by Metadata complications when conceiving of definitions for unstructured and semi-structured data. Other implications pertain to stewardship and information lifecycle management, while there is also the concern that Big Data Governance inhibits agility while minimizing risk.
Finally, it is important to realize that implicit in the aim of conducting effective Big Data Governance is integrating Big Data with conventional data, which also presents challenges for a number of organizations who may be using Hadoop as a silo – in an ungoverned format. Aligning Big Data Governance with that of conventional data requires extending the standards and definitions of the latter to Metadata about the former, which is not always easy.
Definitions and Metadata
Aligning Big Data with traditional data based on common definitions and Metadata is one of the seminal steps towards achieving data integration. However, the nature of Big Data sources such as clickstream, NoSQL stores, sensor data, and others will inevitably require coming up with a set of definitions that may be unique to those data sets, but which should ideally be aligned with standards pertaining to conventional data. Consistent definitions can be applied through Metadata, which may present difficulties for organizations that have traditionally used relational databases and now have to manage incorporating unstructured or semi-structured data into their Metadata.
One of the most useful means of coming up with consistent nomenclature for data is through the means of ontologies and Semantic technology. These technologies reduce data into a simple description of what it is and how it relates to other data, so that it can serve as reference data. By using Semantics to classify data, users can readily compare Big Data attributes to those of conventional data and categorize them as such. Also, Semantics technology can reduce all data types – whether Big Data or traditional, click streaming or site blogs – to these descriptions which can be used as reference data to ensure that common metadata is used for standardized governance.
White it may be true that Big Data quality is significantly increased by applying the proper Metadata and standards used for conventional forms of data, the celerity and size of data ingested by Big Data technologies still make quality issues one of the more eminent problems associated with this technology. This is especially the case when streaming data, although it applies to virtually every other form of Big Data as well. The crux of the matter is that with so much data coming in so fast, users would have to dedicate a substantial amount of resources to cleansing the data before they can analyze it – which may not even be worth it, depending on what sort of data is needed and how readily organizations can identify it. According to Forrester’s Michele Goetz:
“The biggest challenge I see many data professionals face is getting lost in all the data due to the fact that they need to remove risk to the business caused by bad data. In the world of big data, clearly you are not going to be able to cleanse all that data. A best practice is to identify critical data elements that have the most impact on business and focus efforts there.”
Factors that determine whether or not a particular source of Big Data is worth the effort for cleaning include its utility, risk, and the amount of time and resources required to do so. A good rule of thumb is to cleanse Big Data before deriving action from it; otherwise, simply treat it as an unsubstantiated indicator.
Stewardship issues about Big Data bring the age-old conflict between business and IT to the forefront, for the simple fact that many Big Data sources can readily influence business decisions in close to real time while Data Stewards have traditionally come from the IT realm. Many organizations have dedicated a “Customer Steward” who is in charge of monitoring customer feedback via social media and other Internet channels. It may not be necessary to create a position dedicated exclusively to this task, although doing presents the benefit of having an individual who not only presides over Big Data, but is also familiar with the varying issues of privacy that pertain to it. Some organizations may simply choose to incorporate Big Data Stewardship tasks into the workload of other stewards which, depending on the amount of their other responsibilities, may or may not be feasible.
One of the best solutions, however, is to incorporate stewardship into the roles of the business users who require the data most. The most formidable aspect about this task is staying abreast of the myriad privacy regulations pertaining to social media and specific sites. But it gives the greatest access to and authority over the usage of Big Data to the business in its quest to create action from data.
Big Data has several implications regarding issues of privacy with highly specific regulations that must be adhered to. In the world of social media these regulations vary by site, and pertain to what sort of user-specific identifiable information (such as phone numbers, addresses, emails, etc.) site participants can legally record and keep regarding one another. Additionally, there are regulations at the state and county level (many of which have been recently passed or are in various stages of passing) regarding what sorts of information gained through social media and other forms of Big Data one can obtain about others – and for how long.
Sensor data – such as that utilized by insurance companies to provide information about a customer’s driving habits and location – also has privacy ramifications regarding how an insurance company uses and deploys such data. With violations resulting in potential lawsuits, privacy regulations must be studied and accounted for in a Governance Program for Big Data much more stringently than for conventional data, which is usually part of the corresponding data steward’s responsibility.
Outside of regulations expressly for Big Data, lifecycle management concerns for Big Data are fairly similar to those for conventional data. One of the biggest differences, of course, is in providing ample resources for data storage that includes annual growth rate. Different departments will have various lengths of time in which they will need access to data, which factors into how long data is kept. Lifecycle principles are inherently related to data quality issues as well, since such data is only truly accurate once it has been cleaned and tested for quality. As with conventional data, lifecycle management for Big Data is also industry specific and must adhere to external regulations as such.
Agility or Risk?
The point of any Data Governance Program is to give a formal accountability of data, its access, and how and where it is stored. With most Governance Programs enforced by a lengthy set of policies, governance councils, and a multitude of stewards, their principle benefit is to reduce the risk associated with having and relying on data to influence decision making. However, such a seemingly rigid structure is counter-intuitive for the implementation of agility, which can immensely increase data’s effect on business and operations.
All of the aforementioned implications for Big Data would be meaningless if they inhibited the natural process by which Data Management is increasingly refined over time. While Big Data Governance Programs will certainly require the structure and formality associated with conventional governance, the use of Semantics in definitions and Metadata, as well as the employment of business (or operations) personnel as stewards should ideally offset some of that rigidity with much needed agility to account for changing regulations and requirements for lifecycle management and privacy concerns.