The Five Horsemen and Disparate Data

by Michael Brackett

Most public and private sector organizations have a serious plague known as data resource disparity.  Their data resource is disparate, that disparity is getting worse, and it’s adversely impacting the business.  Yet organizations continue staking their future on increasing quantities of disparate data.

Disparate data are any data that are essentially not alike, or are distinctly different in kind, quality, or character.  They are unequal and cannot be readily integrated to meet the business information demand.  They are low quality, defective, discordant, ambiguous, heterogeneous data.  Massively disparate data is the existence of large quantities of disparate data within a large organization, or across many organizations involved in similar business activities.  A disparate data resource is a data resource that is substantially composed of disparate data that are dis-integrated and not subject-oriented.  It is in a state of disarray, where the low quality does not, and cannot, adequately support an organization’s business information demand.

Disparate data are the result of three major trends in data resource management—prolific hype-cycles, a large lexical challenge, and the five horsemen*.  The first two trends, hype-cycles and the lexical challenge, are the attitudes that people have which result in a disparate data resource.  The five horsemen are the actions being taken by people that directly cause a disparate data resource.  The attitudes are certainly important, but it’s the five horsemen that are directly creating disparate data.

The first horseman is a brute-force-physical action that goes directly to the task of developing the physical database.  It skips all of the formal analysis and design activities, and often skips the involvement of business professionals and domain experts.  Those taking such an action consider that developing the physical database is the real task at hand and any other tasks are unnecessary.

Brute-force-physical actions include creating the database code without any formal analysis or design of the business needs.  The primary purpose of most data modeling tools, in spite of how they are advertised and marketed, is to cut the code for the physical database.  Physical data models are developed and the database is created from those physical data models.  Business professionals are seldom involved in any review.  If they are involved, the review is superficial because the data models are seldom readily understood by the business professionals.

Although many data modeling tools appear to produce both logical and physical data models, in most situations the data models are really physical.  The data model may show formal names and definitions, or may show abbreviated names and formats.  However, the structure is physical, leading to the terms logical-physical and physical-physical data models.

Brute-force-physical actions often include a conceptual data model as a high level (generalized) data model to gain high level consensus, so that physical development of the database can proceed.  However, no formal logical design techniques, no formal data names, no comprehensive data definitions, no data structure related to the organization’s perception of the business world, and no precise data integrity rules are developed.  The objective is to get a database in place quickly to keep the business happy, yet the result is often an unhappy business.  These brute-force-physical, and sometimes upper-brute-force-physical, actions simply lead to increased data disparity.

The second horseman is a paralysis-by-analysis action that is an ongoing analysis and modeling effort to make sure everything is complete and correct.  Data analysts and data modelers are well known for analyzing a situation and working the problem forever before moving ahead.  They often want to build more into the data resource than the organization really wants or needs at that time.  The worst, and most prevalent, complaint about data modeling today is its tendency to paralyze the development process by exacerbating the analysis process.  Prolonging analysis to get the data model totally complete and accurate delays the project and forces the business to proceed with development, often creating disparate data.

Another frequent complaint about data modeling is that the project is stalled because all the business rules have not been captured or documented.  However, some business rules relate to designing a data resource, while others relate to designing processes.  Only the business rules that relate directly to data resource design need to be captured for data modeling.

A third frequent complaint about data modeling is that all data have not yet been documented and database development cannot proceed.  In many situations the data that have not yet been documented are way beyond the scope of the current project.  Data modelers seem to want to include all data that may ever be needed, rather than including just the data currently needed.

Paralysis-by-analysis is the opposite of brute-force-physical, and is often used as an excuse to justify brute-force-physical actions.  Some database developers encourage paralysis-by-analysis simply to justify moving directly to physical database development.

The third horseman is a warping-the-business action that warps the design of the organization’s data to the fixed data design of a purchased application.  Each organization has a data design that fits their perception of the business world where they operate.  That data design often does not match the data design of a purchased application.  The result is that an organization’s way of doing business becomes warped to fit the application.

Many organizations are serially warping their data design from one purchased application to the next, without any consideration for how the business operates.  Many organizations have parallel warping of their data design where part of the design is warped for one purchased application, another part is warped another way for another purchased application, and so on.  Both of these actions ultimately lead to the data being warped in a manner that does not represent the way the organization desires to do business.

Many predefined data models, whether standards, models, architectures, patterns, templates, and so on, are available to assist with data resource design.  The problem is that many of these predefined data models are used to force an organization into that predefined data model without any regard for the way the organization perceives the business world where they operate.  Such forcing is simply warping-the-business.  Even if the predefined data model was a perfect fit, the organization loses the benefit of going through the data modeling effort to thoroughly understand their business and the data needed to support that business.  A better approach is to use predefined data models to help guide the organization to developing a data resource that directly supports their perception of the business world.

The fourth horseman is a suck-and-squirt action that designates a single record or system of reference, sucks the data out of that record or system of reference, performs superficial cleansing, and squirts the data into a target database.  The action is usually part of an ETL process where little attention is paid to the conditional sourcing of data, the data integrity, or the data meaning.  Such transient data integration usually results in the creation of additional disparate data.  Little progress is made toward formal data resource integration.

The fifth horseman is a process-structured-data action that structures the data resource according to the processes using the data rather than according to formal data design techniques.  Many business professionals describe their data needs in terms of business processes.  Many data modelers and data architects tend to structure the data resource according to those processes claiming the data model is more easily understood by the business professionals and the database is easier to build.  Data files are designed to support specific business processes rather than being designed according to formal data management concepts and principles.  Data are stored redundantly in different data files to support specific business processes requiring bridges and feeds to keep those data in synch.  The result is redundant data and an increase in disparate data.

Data structures are orthogonal to process structures and those two structures must be kept separate during design.  The principle of independent architectures states that each primary component of the information technology infrastructure has its own architecture independent of the other architectures.  In other words, the structure of the data must be independent of the structure of the business processes using those data.  That principle must be kept in mind when developing data models.

Information systems (applications) integrate the data and process structures to perform specific tasks.  They store and extract the necessary data according to the data structure, and process those data according to the process structure.

Avoiding the creation of a disparate data resource requires avoiding the five horsemen.  Data management professionals, including data architects, data modelers, and database technicians, must avoid all five horsemen if they ever hope to achieve development of a high quality data resource that fully supports an organization’s business information demand.  Data must be formally designed to support an organization’s perception of the business world according to formal logical design techniques, and then adjusted according to formal physical design techniques for implementation.


* Adapted from the author’s Keynote Presentation at the Data Modeling Zone Conference, Baltimore, Maryland, November 13, 2012.