Overcoming a World Awash in Dirty Data

By on
Read more about author Eliud Polanco.

Like an invisible virus, “dirty data” plagues today’s business world. That is to say, inaccurate, incomplete, and inconsistent data is proliferating in today’s “big data”-centric world.

Working with dirty data costs companies millions of dollars annually. It decreases the efficiency and effectiveness of departments spanning the enterprise and curtails efforts to grow and scale. It hampers competitiveness, heightens security risks, and presents compliance problems.

Those in charge of Data Management have grappled with this challenge for years. Many of the currently available tools can address Data Management issues for siloed teams within departments, but not for the company at large or for broader data ecosystems. Worse, these tools frequently end up creating even more data that must be managed – and that data, too, can become dirty, causing more headaches and revenue loss.

Understanding Dirty Data

Dirty data refers to any data that is misleading, duplicate, incorrect or inaccurate, not yet integrated, business-rule-violating, lacking uniform formatting, or containing errors in punctuation or spelling.

To grasp how dirty data has become ubiquitous in recent decades, imagine the following scenario: 

Lenders at a large bank become perplexed when they discover that almost all of the bank’s customers are astronauts. Considering that NASA has only a few dozen astronauts, this makes no sense. 

Upon further exploration, the lending department discovers that bank officers opening new accounts had been inserting “astronaut” into the customer occupation field. The lenders learn that the job description is irrelevant to their counterparts responsible for new accounts. The bank officers had been selecting “astronaut,” the first available option, simply to move more swiftly in creating new accounts.

The lenders, however, must have their customers’ correct occupations on record to obtain their annual bonuses. To remedy the situation, the lending department develops its own, separate database. They contact each customer, learn the correct occupation, and insert it into their database.

Now, the bank has two databases with essentially the same information, apart from one field. If a third department wants to access the information in those databases, no system exists to determine which database is accurate. So, that third department might also create its own database.

Similar scenarios have played out in organizations nationwide for decades.

Burgeoning Digital-Data Landfills

The trouble began in the 1990s with the digital transformation boom. Companies deployed enterprise software to improve their business processes. Software-as-a-service products from Salesforce, for instance, enabled better ways to manage sales and marketing systems.

But 30 years later, such legacy infrastructure has resulted in a Data Management nightmare. Disparate data silos with reams of duplicate, incomplete, and incorrect information pepper the corporate and public-sector landscapes. Those silos comprise lines of business, geographies, and functions that respectively own and oversee their data sources.

Beyond that, data generation has increased exponentially over the decades. Each business process now necessitates its own software, producing evermore data. Applications log every action in their native databases, and obstacles to mining the newly created data assets have surfaced.

In previous decades, vocabulary defining data was specific to the business process that created it. Engineers had to translate those lexicons into discrete dictionaries for the systems consuming the data. Quality guarantees typically didn’t exist. As in the astronaut example above, data that was usable by one business function was unusable by others. And accessibility to data from original business processes was limited, at best, for functions that might have otherwise achieved optimization.

The Copy Conundrum

To solve this problem, engineers began to make copies of original databases because, until recently, it was the best option available. They then transformed those copies to satisfy the requirements of the consuming function, applying Data Quality rules and remediation logic exclusive to the consuming function. They made many copies and loaded them into multiple data warehouses and analytics systems.

The outcome? An overflow of dataset copies that read as “dirty” to some parts of the organization, causing confusion about which copy is the right one. Companies today have hundreds of copies of source data across operational data stores, databases, data warehouses, data lakes, analytics sandboxes, and spreadsheets within data centers and multiple clouds. Yet, chief information officers and chief data officers have neither control over the number of copies generated nor knowledge of which version represents a genuine source of truth.

A host of Data Governance software products are available to bring some order to this mess. Those include data catalogs, Data Quality measurement and issue resolution systems, reference data management systems, master data management systems, data lineage discovery, and management systems.

But those remedies are expensive and time-intensive. A typical master data management project to integrate customer data from multiple data sources from different product lines can take years and cost millions of dollars. At the same time, the volume of dirty data is increasing at speeds that outpace organizational efforts to install controls and governance.

These approaches are rife with flaws. They rely on manual processes, development logic, or business rules to execute the tasks of inventorying, measuring, and remediating the data. 

Recovering Control

Three emerging technologies are best suited to tackle the current predicament: AI- and machine-learning-driven Data Governance, semantic interoperability platforms such as knowledge graphs, and data distribution systems such as distributed ledgers: 

1. AI- and machine-learning-driven Data Governance solutions reduce dependency on people and code. AI and machine learning replace manual work with actions that include auto-tagging, organizing, and supervising massive swaths of data. Data Management transformation and migration decreases IT costs. Organizations may also build more robust and sustainable architectures that encourage Data Quality at scale.

2. Knowledge graphs allow native interoperability of disparate data assets so that information can be combined and understood under a common format. By leveraging semantic ontologies, organizations can future-proof data with context and a common format for reuse by multiple stakeholders.

3. Distributed ledgers, differential privacy, and virtualization eliminate the need to physically copy data. Distributed ledgers comprise federated and governed databases usable across business units and organizations. Differential privacy makes it possible to mask data to adhere to compliance requirements, while simultaneously sharing it with stakeholders. Virtualization permits the spinning up of data in a virtual rather than physical environment.

Once CIOs and CDOs understand the problem’s root is legacy infrastructure that creates data silos, they may improve underlying architectures and data infrastructure strategies.

Dirty data limits an organization’s ability to make informed decisions and operate with precision and agility. Organizations must take control of their data and encourage data interoperability, quality, and accessibility. Doing so will furnish competitive advantages and erase security and compliance vulnerabilities.