The Data Landscape

by Malcolm Chisholm

Data is increasingly being seen as a strategic asset by enterprises in general. This comes after many years of evangelizing by the data management profession, but also by a growing realization by business users that IT is not just about technology. However, there are many new demands being placed on data management as a result of this shift in perceptions. One of them is an expectation that the enterprise’s physical data assets will be managed, even though data management has, in the past, focused on logical components such as data models, rather than physical data.

Some very simple questions are now being asked of data management. What data does the enterprise manage? Where is it located? What does it mean? These questions are concerned with physical production data. In order to answer these questions it is helpful to think of the disposition of production data in the form of a map. If the physical data can be thought of in this way, then its totality is best described as a landscape. This description is also fitting because individual IT personnel and knowledge workers, and even data management processionals, typically cannot see very far across this landscape. They can usually see data out to some horizon that is very near them, as on a small planet with a large radius of curvature. The term “landscape” is very apt in this context.

The data landscape is the totality of the enterprise’s physical data assets. It is not the logical view of how the enterprise sees, or should see, its data in terms of its business and which is – in theory – captured in logical data models. However, the idea of being able to look at a map of the data landscape and see what is out there, and then be able to zoom in on an area of interest is incredibly powerful. The notion of virtually hovering over the terrestrial landscape and being able to view it broadly from a high level, and then to be able to magnify it to any level of detail is almost an expectation today. It is exactly the kind of approach we need to work with the enterprise’s physical data assets.

How Do We Map the Data Landscape?

Unfortunately, while the idea of a data landscape is very powerful, we do not yet have a good way to represent it in diagrammatic form. To do that requires somebody to think about how to present the constructs that exist within the landscape in a graphical manner. Such a representation, if it is ever achieved, is likely to be rich and complex. Since we have not reached that point for the data landscape, the next best thing to do is to treat the components of the landscape as structured metadata items stored in a repository.

If representing the data landscape is one problem, figuring out how to map it is another, and much larger, problem. There is rarely any documentation that is available for such an exercise. Of course, “documentation” is nearly always done to be in compliance with some kind of directive. It is very rarely kept up to date, and it is seldom trusted. If it is used at all, it is as a kind of starting point in analysis. An additional problem is that documentation is frequently undiscoverable – buried in folder structures on disparate servers. It may be very difficult to get at the appropriate documentation in a timely fashion.

Data models should, in theory, be helpful; but if they are logical data models they rarely help in understanding physically instantiated databases. Physical databases are often quite different to logical models, and logical models are rarely updated synchronously with physical database structure change. Furthermore, physical databases have additional metadata associated with them, such as the RDBMS platform, the server, the IP address, and so on. These problems cannot be overcome by reverse-engineering a physical database into a data model. The resulting entities and attributes will have no definitions or business names, and it is very rare that declared foreign keys exist to show what the logical relationships are. All this means that there is very limited knowledge of physical data assets.

The Scale of the Landscape

The traditional way in which most data professionals would attempt to map the landscape would be by forward human analysis. Individuals are given the tasks of looking at tables and columns in particular databases and figuring out what it all means. The analysts may use SQL to query the data values, and may even have data profiling tools to help them in their work. However, such efforts are doomed from the start if they go beyond just a handful of columns.

Imagine an organization with 50 databases, each with an average of 100 tables. This is not a big organization, and 100 tables per database is not huge. If each table has an average of 10 columns we now have 50 x 100 x 10 = 50,000 columns spread out across the data landscape. No organization has enough analysts to crawl through such a landscape and map it out. Not only are there too many columns, but the tasks required on a per column basis are too complex and large.

For instance, a column could be considered as a key, or candidate key, if every value in it is unique. However, it may not be declared as a key in the database table. I have seen plenty of examples of this. Suppose we are also dealing with a database that has no declared foreign keys – something very common in my experience. If there is another table with a column that is actually a foreign key of the column that functions as a key in the first table, how will we ever find it? A human analyst will have to make guesses and then test them. The testing will involve examining the contents of the two columns in question to see how related they are.

The idea that source data analysis – the analysis of physical data values – maps exactly to the task of analysis in systems development is an illusion. The word “analysis” may appear in both places, but source data analysis and systems analysis are quite different concepts.

The Way Forward

Not only is such an approach impractical, it is also unreliable because human analysts may not have the time or capacity to think about all the pattern matching they need to perform. It is also unsustainable. Suppose a team of human analysts were able to map out a tiny portion of the data landscape. As they moved onto the next area, there is a good chance that the ground they just covered is going to change. The longer the time that passes the greater the doubt there will be about the current accuracy of what they have mapped.

The answer to this is that mapping the data landscape cannot be done by forward human analysis. It needs tools. These tools are now appearing in the marketplace, although the concepts behind them are poorly understood and the discipline of enterprise information management is not mature enough to utilize them fully. Nevertheless, the direction is clear. The enterprises that understand their data landscapes will be able to manage their data orders of magnitude more effectively than those who do not. The data landscape is a new paradigm of extraordinary power, and it is coming.

ABOUT THE AUTHOR

Malcolm Chisholm

Malcolm Chisholm, Ph.D., has over 25 years of experience in enterprise information management and has worked in a wide range of sectors. He specializes in setting up and developing enterprise information management units, master data management and business rules. Malcolm has authored two books: Managing Reference Data in Enterprise Databases (Morgan Kaufmann, 2000) and How to Build a Business Rules Engine (Morgan Kaufmann, 2003).

Shannon Kempe

Why Your Semantic Layer Will Make or Break Your AI Strategy

The Grounding Truth: Why AI Is Desperately Seeking Data

Mind the Gap: Deploying Data Products … Finally

Thanks!