The Impact of Data Quality in the Machine Learning Era

By on

Data QualityA significant issue in Enterprise Data Management today is Data Quality, because business data requires thorough cleansing and preparation to be used as input to any Analytics or Business Intelligence system. In an era of automated and Self-Service Business Analytics, Data Quality has assumed even more importance as average business users do not often have prior knowledge or skills to differentiate between bad and good data, but they are suddenly equipped with Advanced Analytics tools for extracting competitive and actionable intelligence from piles of complex data.

Heterogeneous data sources, high volumes of data, and a myriad of unstructured data types are now adding to the existing Data Management problems, especially those relating to Data Governance. Modern IT systems are not yet capable of handling Data Quality, which directly impacts data extraction from multiple sources, data preparation, and data cleansing. The persistent Data Quality issues indicate one critical Data Management area that has been hitherto ignored: Enterprise Data Strategy.

In today’s competitive world, every organization needs a well-designed and sustainable Data Strategy to combat the obvious complexities of multi-source, multi-type, and very high volumes of data pouring in from the latest technology funnels.

How Does Data Quality Affect Enterprises?

The recent surge in data collection technologies, coupled with advanced sensor-based hardware and low-cost data storage facilities stationed in many businesses, has put “more data at fingertips (of businesses) than ever before,” However, the sad truth is that most businesses are at a loss about what to do with the petabytes of data pouring in through daily business processes.

In Machine Learning Plays a Critical Role in Improving Data Quality, Matthew Rawlings, Head of Data License at Bloomberg, said, “It takes a lot of manual effort to clean and run that data and add some business intelligence on top of it.” The author of this article points out that understanding “context” is the key factor behind solving most data-metadata mismatches.

The article Data Quality in the Era of A.I. states that the consequences of “poor quality data” are wasted IT investments, loss of trust on enterprise data, and ineffective business decisions. Although the global IT community has partially mitigated the absence of qualified Data Scientists by designing AI or Machine Learning (ML) powered, semi- or fully-automated Analytics Systems, the fundamental problem of Data Quality still remains. End users cannot and will not trust insights that are acquired by processing corrupt, duplicate, inconsistent, missing, broken, or incomplete data.

How Are Data Management Solution Providers Responding?

The DATAVERSITY® article Data Quality in the Enterprise asserts that no matter how advanced the data technologies and how vast the data assets of an enterprise are, they cannot be useful without “reliable Data Quality.” So, more attention should be paid to data collection, storage, and preparation practices before data analysis takes place. The newer technologies like IoT are also affecting the input Data Quality, thus only good Data Governance can provide a solid structure to Enterprise Data Management practices.

The buzz around self-service, Big Data, and Machine Learning that surfaced with Data Quality Tools in Gartner’s 2017 Magic Quadrant will continue this year too. In 2018, Data Quality will increasingly occupy the topmost spot in most Enterprise Data Management priority lists, which in turn will serve as a constant reminder to solution vendors to focus on Data Quality issues on their ready-made platforms.

Most of these problems have to be rectified at the software level. Advanced Data Management systems of the future will hopefully have sufficient data-validation “logic” to check and filter out invalid data from the systems. The items which occupy the Enterprise Data Management wish list today can become system features or functions only if the complete Data Management blueprint is available via Data Strategy activities.

Can Machine Learning Improve Data Quality?

An Executive’s Guide to Machine Learning describes an ML algorithm as a self-teaching entity that learns from available data. Thus, data must be accurate and complete to be a reliable teaching source. The “unmanageable” size and quantity of Big Data is a major challenge to most industry operators, but this challenge can be safely confronted and tackled if sound Data Management practices are in place. This guide reconfirms that good Data Quality is needed for any Machine Learning algorithm to deliver accurate results.

A Forbes writer asserts that “The value is in the data.” For technologies like Big Data to succeed in the future business ecosystem, AI and ML tools have to deliver results. Now the industry leaders and business operators will wait for Data Quality assessment and validation methods available within Machine Learning systems to improve over time to make such systems most useful.

Data Quality needs to be responsible for the current state of Enterprise Analytics. Disruptive technologies have the power to capture, collate, and synthesize disparate data formats (physical, transactional, geospatial, sensor-driven, or social), but these rich collections of data will fail to deliver useful insights unless they are appropriately prepared and cleansed as input to advanced AI or ML tools.

Data Quality directly impacts the outcome of Machine Learning algorithms, and data testing has proved that good data can actually refine the ML algorithms during the development phase. There is a close connection between Data Quality and ML tools and the long-range monetization prospects of “high-quality data” used in the industry.

Analytics and Machine Learning to Improve Data Quality provides a case study of a global retailer that achieved cost and production efficiency by improving the quality of their product and inventory data with the use of ML algorithms. An outsourced Analytics solution powered by ML algorithms was used to improve the client inventory and product data. The selling point of this solution was innovative Data Quality-centric rules to detect and rectify bad data. At the end of this experiment, the solution provider found that the new system detected and rectified about 30 percent of the records in test runs.

The BIS Workshop Paper how Machine Learning technology was used to automate the process of detecting errors in statistical measurements. Data acquisition and preparation, data analysis, and delivery of insights must all coexist in a unified, data-validated ML system.

The Price of Bad Data: A Failed Data Management System

Data Quality Blooms with Crowdflower describes a system of Data Management, which in spite of implementing Data Strategy/Data Stewardship and high-quality tools, failed to deliver the expected results. The reason behind this failure is Data Quality, which, as the author of the article suggests, must be jointly tackled by an organization’s data owners and data users.

Harald Smith, Director of Product Management at Syncsort, remarks that business users are making the connection between good data and good business decisions. For future businesses to thrive on a competitive differentiation, the core Data Governance Strategy of the business will play a crucial role.

It may be time for Artificial Intelligence solution vendors to redesign Machine Learning systems and tools around Data Quality issues. As Peter Isaacson, the CMO at Demandbase, stated, “Artificial intelligence will destroy the world but not before it really helps B2B marketers.”


Photo Credit: pixldsign/

Leave a Reply