There is no shortage of hype regarding the importance of Big Data initiatives in terms of analytics and the business value they can provide the enterprise. The reality of the situation though is that Big Data effectively forsakes both Data Governance and Data Quality in numerous ways, complicating Big Data Governance and Big Data Quality. Some of the more prominent ways in which Big Data does so include:
- At point of origin: The best example of this fact is sentiment analysis provided by social media sites, in which references to customers, products, and services are made in any variety of ways that are outside of conventional Data Quality standards. According to Chris Martins, Product Marketing Manager of Trillium Software: “There are no policies and procedures about how you initiate that data. On some social media sites in which your customers are interacting, there’s no control over them forcing them to use their proper identity. There are no requirements that they use first names and last names, avoid nick names, [and] use proper addresses. All of the things that are sort of proper Data Governance policies don’t really exist out there.”
- Data Lakes: The tendency to utilize Data Lakes as a single repository to store and expeditiously access all data regardless of typical governance conventions for metadata, schema, and Data Modeling may prove valuable in the short term, but can certainly worsen virtually any data-driven process in the long term.
- Immensity: The sheer volumes of Big Data, which are also exacerbated by the ever broadening number of sources of different types of data that Big Data facilitates, can overwhelm organizations without rigid Data Governance and Data Quality measures in place. Conversely, too strict of an adherence to governance and quality can substantially decrease the time in which those valued sources is used to create meaningful action.
In fact, the means of utilizing Big Data in a timely manner that generates business value on an on-going basis, without negatively impacting an organization, can well hinge on issues of dependable Data Quality in familiar platforms and architectural conventions (Cloud Computing) for this technology. As Martins noted:
“Customers still want the quality and they want to be able to leverage it. In order to do that our offering provides the capability for applying the same kind of Data Quality principles that we’ve applied to traditional data stores and records and now extending it out to Big Data.”
One of the critical elements of implementing Big Data Quality is accounting for a variety of data sources—many of which are outside of the enterprise’s firewalls. In particular, it may be difficult for organizations to do so in a way in which they can effectively consolidate data sources to provide a single view of a customer, their habits, and their points of interaction with the organization. Martins stated:
“The challenge is for our clients in these environments; their systems assume data in a certain format, in a certain structure, with a certain level of completeness. The reality is it’s a rare instance in which you get those requirements met.”
Tools for providing Big Data Quality include offerings designed to integrate with Hadoop (including Trillium Big Data) and those that provision quality measures through the Cloud (Trillium Cloud) as a service. They are able to implement a Data Quality layer that filters any number of sources for quality standards prior to being ingested and aggregated with a company’s core data. The result is that in addition to aggregating data from disparate systems such as social media platforms, ecommerce applications, and marketing automation tools, organizations are able to do so in a way in which they account for:
- Incomplete or incorrect fields
- Name or address variations based on profiles in different online communities
- Information Appendages
The result is an aggregated view of a customer’s data with a trustable level of quality. As Martins explained:
“We provide that Data Quality glue to bring all of those things together. We’re not the integration glue ourselves, but we make it workable. As you import data from multiple sources, we become the quality mechanism by which you ensure you’re not introducing more errors.”
Martins also indicated that one of the drivers for Trillium Big Data and the need to exercise Data Quality for Big Data initiatives (particularly those that utilize Hadoop) is the contemporary relevance of Data Lakes. The long term viability of these repositories comes into considerable question when they are utilized at an enterprise wide perspective. However, implementing Data Quality controls at a systemic level for options such as Hadoop and for other NoSQL offerings that are used as Data Lakes effectively helps mitigate their primary detriment—their native lack of governance and quality support—while augmenting their fundamental strengths of common access and increased time to action.
Quality measures for Data Lakes actually extend beyond the need to facilitate one of the more recent trends across the Data Management sphere while increasing the viability of Big Data. According to Martins, they actually strike at the core of another fundamental issue of Big Data in regards to privacy, and a means of facilitating a positive customer experience in a world in which organizations must balance their needs for gathering data with the desires of customers for them to do so. In this respect, quality measures for Big Data help to provide a critical compromise that can help ensure that organizations are accomplishing Data Governance in a way that is aligned with their organizational values. Martins observed:
“If you impose tremendous rigor which is what your systems want you will drive a lot of your prospects away trying to get that information, because they just want to interact with you in a comfortable manner. How do you marry your internal goals for governance in terms of customer interaction or Big Data with someone who’s not in your span of control: a prospect, a customer, a business partner? How do you marry what they want with what you want? So [with Data Lakes] maybe there’s no governance initially, but you still want to be actionable with your data.”
Data Quality as a Service
Another aspect of Big Data and Big Data Governance is the deployment of Cloud-based models and Software Oriented Architecture (SOA). Cloud options are oftentimes desirable for various facets of Big Data initiatives, from the bevy of analytics’ services to Platform as a Service and Infrastructure as a Service offerings. Data Quality as a Service options have advantages over on-premise variations of Data Quality solutions in that they utilize reduced infrastructure and are priced more economically. Martins noted that they also facilitate a relatively rapid time to value since organizations can frequently begin using them much quicker than they can on-premise versions. Such versions frequently require significant planning for infrastructure needs which can stall when organizations need to scale up:
“By deployment via the Cloud we are effectively removing all of the technology requirements and, I’ll use the term ‘below the water line’, all of the infrastructure in terms of servers, hardware and software infrastructure, operating systems, middleware—all of that ‘stuff’ that you need to run an enterprise class solution such as we offer.”
Quality Big Data
Big Data Quality controls can play a substantial role in implementing Big Data Governance and mitigating some of the complications for governance that Big Data inherently has. They can make sense of and effect standardizations for social media and other unconventional data sources, prevent Data Lakes from becoming swamps, and ensure Data Quality over enormous sets of data while adding a critical governance layer to the data integration process. Their ultimate advantage, of course, is their proclivity for reinforcing governance with a technology that, in many ways, naturally eschews it, while creating greater trust, comfort, and experiences for customers.