Automated Data Catalogs Allow Rethinking of Data Policies

By on

Data CatalogIn Data Catalogs we trust. Or at least that should be the case. In a recent global survey on data-driven decision making in business, close to 60 percent of respondents say their companies base at least half of their regular business decisions on gut feel or experience rather than data and information. Respondents cited problems with availability of the necessary information as well as the quality of the data as top barriers to becoming a data-driven organization.

“Managers don’t understand how the ‘analytics sausage’ was made,” commented Stephanie McReynolds, V.P. of Marketing at Data Catalog vendor Alation in a recent DATAVERSITY® interview. Business users now have access to easy-to-use tools to visualize and analyze data, but it feeds into a “wild west of Self-Service Reporting” where the output produced is not necessarily well-governed, she noted.

Two business analysts on the same team can come out with two different answers to an issue because of their different perspectives and assumptions and understandings about what data means; that leads them to conduct each step of the analytics process in their own way, she explained. But unless leaders can understand the work each analyst did to get a result, it’s difficult for them to know which analysis to trust and base decisions on.

Rise of the Automated Data Catalog

Managers can’t spend time documenting every step of the analytics process. That’s where the idea of the automated Data Catalog takes shape. Machine Learning and deep parsing of query logs automate the collection of data —  including technical metadata, user permissions, and business descriptions —  in one place, and take advantage of connections to Data Visualization tools and databases and HDFS to index data by source. Clean definitions about data and the nuances of particular data sets are the result of business glossaries that are linked to data dictionaries, which provide a technical representation of data.

With this in place, humans only have to worry about handling 20 percent of the Data Curation work, not 100 percent, while continually improving their understanding of data.

“Documenting and sharing work becomes worthwhile when it’s just a small investment of time,” McReynolds said. “There’s more literacy and alignment throughout the organization.”

Data Catalogs are most useful in helping tag and track information through the cycle of usage, she said. About 100 production customers are using the Alation Data Catalog, including large brand names like Albertson, Tesco, and GE.

“Decisions around data in the Global 1000 is the bigger pain point,” McReynolds said. Data Catalogs, she notes, “are forcing rethinking of data policies and how complicated some analytics processes have become.”

Getting Proactive with Data

In the case of Alation’s automated Data Catalog, applications atop it provide proactive recommendations to data consumers of what data to use in the language of the business, she says, and help in building queries around data sets with recommendations for joins, filters, and so on. “We’ve done a lot of work in our information stewardship app that’s part of Data Catalog,” she said, including the introduction of the trust check feature. A request of which data set to use initiates a recommendation for the data consumer.

“As you write a query, a prompt comes back and identifies the data set you want to use as a trusted or certified data set, so there’s more transparency for the user,” McReynolds said. Data Stewards and Data Scientists help in certifying data sets and methodologies as the gold standard for the organization. Of course, users are free to choose any data set they want to work with, but that could present a risk to their Analytics conclusions.

“We want to enable stewards with a way to propagate the idea of working through the system,” she said. “This highlights the value of stewardship and makes them a collaborator in the analysis being done, even if it’s in an automated fashion.”

Data is Money

Alation’s Data Catalog has the ability to make an economic evaluation of data, in its raw state or in a final report. This involved taking some of the methodologies around infonomics and including them in the product. With this capability,

“The CDO can inventory all the data assets and make that a living catalog to understand the behaviors of data that is used and put an economic figure on it and how to manage that data,” said McReynolds. “If you keep a stable data store and processing layer, you get more insight into the impact of data and you automate the audit trail.”

Infonomics can be used to do a cost-benefit analysis of a table of data that describes a customer. Understanding how often it is used by analysts or Data Scientists in Analytics — and if there are business decisions being made from the output of that analysis — can help determine the economic impact of that data on the business.

As an example of the real-world economic value of data, the grocery company Albertsons’ customer insights team used Alation to run an analytics project around driving coupon redemptions. To determine what customer segmentation data to use for their analysis, the team typed a search term into Alation, which helped find a data set to support its analysis.

“The output was an improvement in the personalization of the coupon redemption program. That increased 300 percent,” McReynolds said. “That is the business impact value of that data for just one project.”

Add that up across ten to thirty projects a year, and the economic impact of that data set for the company could be significant. “You can get a dollar figure to the raw data set that’s being used,” she said.

There are other ways of applying infonomics to Data Management that can be a benefit to the organization. For instance, instead of making data decisions based on accepted standards, they can be made on actual observations and values. Rather than following a general practice of archiving data on a 10-year timeframe, organizations can track how many analysts use the data and what algorithms it contributes to and what business outcomes it is involved in, and determine from there if it is worthwhile to maintain or archive it, while also taking into consideration its value compared to the financial risks it could present to the organization, as in the case of breaches of customer-sensitive information.

This is still an emerging area; the methodologies behind infonomics have been developed over the last 10 to 15 years – and mostly as an academic concept. Said McReynolds, “We are just now starting to see the early stages of how to apply that to Data Management in organizations.”


Photo Credit: Maxx-Studio/

We use technologies such as cookies to understand how you use our site and to provide a better user experience. This includes personalizing content, using analytics and improving site operations. We may share your information about your use of our site with third parties in accordance with our Privacy Policy. You can change your cookie settings as described here at any time, but parts of our site may not function correctly without them. By continuing to use our site, you agree that we can save cookies on your device, unless you have disabled cookies.
I Accept