Data Quality in the Data Lake

By on

data qualityIt’s been a few months since data migration and integration vendor Syncsort completed its acquisition of Data Quality solutions vendor Trillium Software. The combination of the two vendors is already proving a good fit, mirroring where the industry is going, according to Harald Smith, Director of Product Management at Trillium.

As he explains it, Syncsort is in the business of liberating data, taking mainframe and legacy data to environments like Data Lakes where it is consumable and can be harnessed for Big Data Analytics operations that can push business insights forward. Trillium had already enabled its Trillium Software System (TSS) Data Quality capabilities to run on Hadoop, bringing to the picture the ability for those insights to be trusted because they are built on a solid foundation.

Data profiling, or analyzing data contents to compile summary information about records in support of understanding issues like the accuracy, consistency and completeness of data within a system, is a core strength of its solution.

“There are a lot of challenges with a lot of the data we work with,” says Smith. Today, Data Quality challenges manifest in new ways in large data lake environments, where companies want to use known and unknown sources of data with highly varied formats and disparate meanings and uses, and questions of trust emerge around original data and around data that winds up getting acted on. In addition to the need to trust the accuracy of the data for analytics, he says, there are also compliance mandates that the diverse data sources and movement of data can complicate.

The Data Lake Dilemma

Trillium is exploring this issue and how to evaluate Data Quality to make sure that the data that ultimately is used is relevant and trustworthy. It starts from the premise that you can’t establish a level of Data Quality upfront going into the Data Lake environment. And Syncsort’s high volume ingestion capabilities come into play, with its expanded range of data connectors extending Trillium’s reach, acting to efficiently take legacy data in bulk from all sources and bringing it into the setting.

“Now we are looking to connect that into our profiling capabilities that lets us say, ‘Here is the core content you have,’” Smith says. It’s not about restricting what comes in, but having better context about what comes in so that you can make the best business decisions when acting upon it.  “Not only is quality an issue going into the Data Lake, and within the Data Lake itself, but it is also a concern as it moves out of the Data Lake to other platforms, especially for analytics,” he says.

“You want rich content coming in but don’t want to do it in a way that is prescriptive upfront,” Smith says. It doesn’t work for the people who are dealing with Big Data in Data Lakes, like Data Scientists puzzling over active business problems, to have pieces of information excluded offhand. They might have important anomalies that must be accounted for in order to accurately answer queries. “You want to get to where Data Scientists can begin to understand the content, apply business rules to it, explore it, and validate or potentially invalidate hypothesis,” he says. “And then be able to put things in place that are meaningful and have value and establish a level of trust.”

Data Quality in Transition

Smith says it’s interesting to consider that the core dimensions of understanding Data Quality – its completeness, validity and timeliness – are changing, too. For instance, social media is part and parcel of the Big Data/Data Lake package. But how do you know if something like a Twitter feed pouring into your Data Lake is complete and valid enough to trust as you’re conducting Big Data Analysis?

He references Hurricane Sandy where 20 million tweets poured in from Manhattan, which wasn’t the hardest hit area by any measure. People in the places that did suffer more were less likely to have electricity or cell signals because they were in the middle of the event, and so less likely to be able to tweet. “You have to start thinking of whether you have the right coverage in your data,” he says, if you hope to come to accurate conclusions. Is your Twitter or sensor or other feed infusing your data lake covering everything you must address, or what other data must be brought in for context?

Trillium’s data profiling helps here by giving the individuals who are most likely to understand such issues, like business analysts, the ability to use business rules to ensure that business information is fit for purpose and contextual to the problem you’re trying to solve or issue you’re trying to assess. What Trillium sees tied into the mix is the need to connect core Data Quality information with Data Governance capabilities, leading to its partnerships with several Data Governance vendors. “We can feed our profiling or rule-based information in so that if you look across the Metadata landscape or the business semantic layer, you can get some insight into the quality measure as well,” he says.

In fact, over the last couple of years the company has focused on revamping the user experience for people working with TS Discovery’s profiling capabilities so that they can be readily used by business analysts. They can look at the data at multiple levels and across different attributes to better understand patterns, and build business rules to construct what gets generated out of it.

“It’s really geared around pushing this out to a broader line of business community that wants to be able to take advantage of the power of data profiling and making assessments about data and rules around data sources,” he says.

Trillium’s Next Steps

Trillium continues to extend its Data Quality capabilities with the integration with Syncsort’s product line. For instance, most recently it is focusing on Trillium Precise, a new data enrichment solution built on top of its Data Quality foundation that helps organizations validate, verify, and enrich data about the primary communications channels such as phone, email, and IP address used to interact with their customer in support of a 360-degree patron view.  “It’s really one of those things we find resonating with customers,” says Smith.

In addition, Smith says,

“We will soon be announcing new integration between Syncsort’s DMX-h data integration and Trillium’s data profiling and Data Quality software based on some common customer use cases that address the challenges we’ve discussed around the Data Lake.”

Photo Credit: Panchenko Vladimir/

Leave a Reply