Advertisement

Ten Recommendations for Building Great Data Catalogs

By on

Click to learn more about author Oksana Sokolovsky and Rohit Mahajan.

Data catalogs can be powerful platforms for Data Management, and enterprise interest in them is continually growing.  But all the power and features data catalogs may bring can be squandered without a good data cataloging methodology, paired with common-sense practices.  With that in mind, we present below ten recommendations for data catalog success.

1. Catalog all your data
Think your relational databases and data warehouses have all your data?  Think again!  Data is everywhere, in text files, spreadsheets and more.  You may not like how scattered it is, but you can’t even begin to address that issue until you’ve inventoried everything, by looking far and wide.  Get everyone in the team on-board and instill in them the discipline to think through all the places where their data may be nestled.  Then make sure to get it in the catalog.

2. When it comes to data flows, expect the unexpected. 
Data lineage and provenance tools are good, as far as they go.  But most of them map out the flow of data within a known domain or set of domains.  A good data catalog, one that’s backed by data flow discovery, will often identify flows between quite disparate data sets.  This helps you discover data movement within your organization which may not be well-known.  These flows can then be checked for validity.

3. Make sensitive data paramount
A major mission of a data catalog is to help identify the location of sensitive data, wherever it lies.  And if the same sensitive data is found in multiple places, that can help you identify redundant data, too.  Exposing sensitive and redundant data lets you manage it, minimize the surface area for breaches and establish robust data protection.

4. Include “unstructured” data, too
All data has structure – though, for some data, the structure is in the eye of the beholder.  Your data catalog can help make implicit data structures explicit, by prescribing the structure, in context for your team or organization.

5. Use good names; use even better descriptions
As good as a name might be, a verbose description will make your data more discoverable by more team members.  A description can indicate alternate names for the same object and help build out your data ontology more comprehensively.  Remember, one person’s part number is another person’s catalog number.  Good descriptions will make that clear.

6. Remember, data lake “tables” are different
Unlike relational databases, where data may be spread across multiple tables, data lakes tend to crowd lots of data into individual files.  In the parlance of Business Intelligence, a single data set may store measures and dimensions together, rather than separately.  This is true even for systems that represent data as tables in a database (as opposed to files in a folder).  This can make the data less discoverable, but data catalogs address that problem head-on.

7. Be judicious in your ratings
Crowd-sourced star ratings, endorsements and deprecations in your data catalog can help users get to relevant, reliable data, faster.  But you’ll need stringent standards.  Data shouldn’t get a five-star rating unless it meets a very high-bar.  Likewise, good data shouldn’t be rated poorly.  Users need confidence in the ratings, or they won’t trust them.  Make sure standards are uniform and precise.

8. Make it a lake, not a swamp
Cataloging everything in your data lake enables you to organize it and make it useable.  Once your lake is cataloged, you can establish zones within it, and make it a go-to place for business users to get data, not just a place for them to dump it.

9. Data validation, rules
Plain-English descriptions in a data catalog are important and help record and disseminate so-called tribal knowledge from business users.  But the technologists need to participate too, by entering strict data validation rules that can verify that data matches catalog definitions.  This helps assure data quality and acts as a check against more qualitative star ratings and endorsements.  Having validation rules is the data catalog equivalent of “trust but verify.”

10. Leverage Machine Learning
Today’s data volumes make it impossible to catalog everything manually.  You’ll simply never finish, or even keep pace, as new data arrives.  Machine Learning is the key to asserting control over the volume problem. 

Machine Learning models can identify data types and relationships, out of the box.  Moreover, they can observe the data types and relations you identify and incorporate that information to increase accuracy.  This helps build out your catalog across more data sets, and propagate data tags across more objects, much more quickly than with a manual catalog build-out.

If your data catalog doesn’t leverage Machine Learning in the actual data and relationship identification work, you’ll face enormous headwinds in your data-driven journey

Ultimately, a data catalog is a guidebook to your data, organized in a fashion that makes sense to you and your team.  With a little formality in approach, you’ll be in a position to organize, govern and utilize your data to its fullest potential.  These recommendations should get you on the right path.

Leave a Reply