Make it as easy to add and connect new data sources into the enterprise analytics infrastructure as it is to add a new web site onto the modern web. That’s where next-gen data curation company Tamr, a startup born from an MIT research project to bring together lots of tabular data sources in a scalable and repeatable way.
Just like Google does all the work to find and connect web sites hosting the information that users want, “we want to do the same with tabular data sources inside the enterprise,” says Tamr co-founder and CEO Andy Palmer. “Tamr provides systems of reference. If you are looking for attributes to add to an analysis or want data to support something, you have this reference place to go in the enterprise with a catalogue of all the data that exists across the company.”
So often businesses want to use analytics to address hard questions, but can’t do so successfully unless they are integrating lots of disparate data sources and creating a referential catalog. With Tamr, Palmer says, they can ingest data sources very quickly into a semantic triple store, make them available in real time, and connect them using machine learning to map attributes and match records, in support of providing a unified view of a given entity that can then be consumed by various business intelligence and analytics tools. To be useable, he points out, data has to be “very, very thoroughly connected into everything else for there to be context and reference for how it can be consumed and whether it is reliable.”
Helping drive that is the system’s ability to generate specific questions to be asked of data experts – often the owners of the data sources – that are designed to improve confidence levels around the attributes it maps and the records it matches. Data stewards are then called on to decide which maps and matches to recognize as authoritative.
“That probabilistic authoritative attributes-mapping and records-matching is at the core of what makes the Tamr system unique,” says Palmer. “We don’t assume that a machine can do all or that humans are required to do everything. It’s an intimate combination of man and machine to curate large quantities of data sources at a very large scale.” In other words, as he terms it, the work is “machine driven and human guided.”
Using big triple stores and machine learning to address the enterprise data unification issue provides a very compelling way for businesses to solve their issues, he says, whether that’s a 360-degree view of suppliers, customers, products, and so on. But Tamr doesn’t get into the weeds of talking about the semantic principles and design patterns behind its solution with clients; rather, it focused on the end results they care about. One large customer, for example, uses the technology to integrate purchasing catalogs from many different divisions in its enterprise to optimize spending, making sure it gets the best price for all products across the entire company.
To that end, “it has to aggregate and rationalize thousands of independent purchasing catalogs from various divisions,” he says. And the company has to be able to ask questions about its suppliers in a relatively simple way across disparate systems in realtime for the answers to be effective in saving them money. Inviting in a consultant to run SQL queries on each system and having her come up with the answers six months later, after all, doesn’t help much, given all the changes that could take place among suppliers in that timeframe.
Tamr shipped the first commercial release of the product earlier this year, and it’s actively working on new features and functions that customers are asking for, he says. Customers to date include Novartis and Thomson Reuters. “Adoption,” Palmer says, “has been pretty dramatic.”