Taming Data Integration

By on

data integration x300by Jelani Harper

Successfully integrating disparate data sources and forms of data in a way that provides a unified view of information relevant to a particular application requires several different technologies including:

  • Semantics: By reducing data to its simplest form with semantic technology, data can be culled, aggregated and compared with one another to determine relevancy via analytics.
  • Analytics: Once data has been reduced to its simplest expression via semantic technology, analytics can determine which particular data is germane to a specific application regardless of data source or form.
  • Machine Learning: Machine Learning technologies can refine the deployment of data for specific applications by applying previous rules and mandates from prior uses to future ones, which reduces latency and enhances application simplification.

Tamr has combined these technologies and others to create a sophisticated data integration product that can reduce the complexity for myriads of data users including Data Scientists, business analysts, and laymen end users by:

  • Facilitating a Unified View of Sources: With Tamr, organizations can quickly view all of their data relevant to a particular task in a way similar to Data Virtualization. Actually accessing the data and utilizing it requires going to those data sources, or building an app to do so.
  • Adding New Sources Easily: The product greatly decreases the lengthy periods of time required by IT to configure data sources and elements for certain applications by enabling organizations to add sources without reconfiguring previous integration efforts.
  • Data Governance: There are several elements of Tamr that are conducive to effective governance, from implementing various data quality measures to involving responsible domain experts in decisions pertaining to certain data types.

Integrated View

The integration aspect of Tamr enables organizations to readily discern which data from its myriad data sources is useful for a specific job. This capability and the aforementioned ones produce the possibility that Tamr can become the de facto tool of choice for viewing data and deciding which of them to use for a particular application, in much the same way that a certain search engine is noted for using the web. According to Tamr CEO Andy Palmer:

“The Tamr system is essentially a system of reference, rather than a system of record. So, very much like Google organizes all of the world’s websites and is the place you go to find websites, Google doesn’t actually host the web sites themselves. Tamr functions similarly with all the different data sources that are out there in the enterprise where we index them, we organize them for you, but when it comes time to actually run analytics, people are probably going to run them directly against the data sources”.

Data Curation

As useful as the ability to integrate disparate data sources is, Tamr also provides a multitude of other uses that transcends integration and affects realms of governance and other aspects of Data Management. Virtualization techniques or the trend to utilize Hadoop as an integration hub can also provide a means of integration. A more profound value associated with Tamr exists in its ability to actually groom and curate data. As such, it maintains its utility when used in conjunction with Hadoop or virtualization technologies.

In addition to processing analytics to determine which data is relevant to a specific use case, Tamr also performs:

  • Data cleansing
  • Data duplication reductions
  • Data transformation

The results are rendered in a format that adheres to Metadata standards. Its Machine Learning algorithms are largely responsible for these processes, as well as for establishing precedents regarding data attributes and conventions, and using them to inform future processes involving similar data and conventions.

Data Governance

Certain facets of the data curation process (such as reducing the instance of data duplication and data cleansing) are some of the hallmarks of Data Governance. Tamr’s relationship with governance is somewhat two-fold. On the one hand, the product enables silos and data marts to continue to exist without the need to actually dump all of their data in one place accessible to all. Yet on the other, it greatly improves the efficacy of those silos and makes them accessible to integrated views of an organization’s entire data—which comes close to making them no longer silos.

The product also reinforces governance by functioning as a starting point with which to view data to see if its formatting and metadata are adhering to governance standards. It also involves Data Stewards and domain experts in crucial questions about data via capabilities associated with its Machine Learning (discussed below). Palmer remarked:

“We’re really seeing a lot of our customers, as they start to use Tamr and get into this sort of bottom level or probabilistic approach to curation, very quickly view the resource that Tamr provides them—which is a view across all their data—as a very powerful method to begin to govern their data more effectively. It rationalizes who has access to what, what sources are coming from where, what’s appropriate and what isn’t.”

Human Tempered Machine Learning

Another facet of Tamr’s Machine Learning technology that helps to render it distinct from other integration platforms is that it incorporates a distinctly human element to setting precedents for and making changes to data based on past experiences. When its analytics process is determining which data is relevant for a specific need, it will attempt to refer most ambiguities or questions to the individuals who have been identified as the owners of the data. The product’s expert sourcing system will automatically generate an email to that individual to resolve any issues. Going forward, the Machine Learning capabilities will incorporate that response into any future issues involving that data or similar types of ambiguities. According to Tamr strategy and marketing lead Nidhi Aggarwal:

“We’re sending out very few questions to the experts, and with that we are tuning the system so that the system can connect the data appropriately. It’s not that the data expert has to answer ever single question automatically. He actually has to answer five questions that help the system resolve 200 or 2,000 records.”

Actual Deployment

One of the most convincing use cases that illustrates the potential of Tamr involves a supply chain management customer that had vast amounts of data across different lines of business. The organization wanted to build an app that would reveal the best price that it was getting for a particular product across the enterprise. However, the data that contained these answers were only available in silos. Utilizing Tamr, the organization was able to get a unified view of all of their product data—which was no easy feat since such data was categorized in various ways according to different Metadata and product number standards, or the lack thereof. Aggarwal noted:

“This is where Tamr was really useful in connecting all of that data by using Machine Learning and using the expert [sourcing system],” Aggarwal explained. “And now that you’ve actually unified the data, you can actually build that app on top of it. You can type in your part number or you can type in your part as you know it and the Tamr system will know that this record is a cluster of parts that is actually the same, and that these are the prices I should actually display for this particular part number. That is the kind of analytics that can pull data out of a unified system that Tamr provides.”

Going Forward

There are many different aspects of the Tamr solution that are noteworthy. Its ability to provide a unified view across data sources and to run analytics to determine data that is relevant to a particular task is invaluable; its potential to assist with governance is something the company hopes to continue exploring in the future. The ease at which it enables organizations to add new date sources in a brief amount of time for a particular application is also notable. All of these factors and the underlying Machine Learning algorithms that can improve performance by making educated inferences based upon previous changes make this product much more than just an integration tool.


We use technologies such as cookies to understand how you use our site and to provide a better user experience. This includes personalizing content, using analytics and improving site operations. We may share your information about your use of our site with third parties in accordance with our Privacy Policy. You can change your cookie settings as described here at any time, but parts of our site may not function correctly without them. By continuing to use our site, you agree that we can save cookies on your device, unless you have disabled cookies.
I Accept