Data Quality and data integrity are both important aspects of data analytics. With the rapid development of data analytics, data can be considered one of the most important assets a business owns. As a result, many organizations collect massive amounts of data for research and marketing purposes.
However, the value of this data depends on its usability and accuracy. Because data comes from a variety of sources, often with different formatting, and can be stored multiple times – with some copies containing errors – working with large quantities of data can become difficult.
To flourish, a modern data-driven business needs to include an emphasis on both data integrity and Data Quality.
The words “integrity” and “quality” both suggest a positive influence and both words are a little difficult to define. As a consequence, many people use the terms “data integrity” and “Data Quality” interchangeably, with the understanding that both terms represent improved data. (A surprisingly large number of articles have titles suggesting the topic is data integrity, but then shift to describing Data Quality.)
It’s the differences between the two definitions that are important. Knowing the differences between data integrity vs. Data Quality can help to communicate your specific needs and concerns to others.
Data should have integrity and be of high quality.
What Is Data Integrity?
The word “integrity” evolved from the Latin word integer, which once meant whole, complete, or undivided. (Currently, the word “integer” means a whole number.) In the 1540s, when applied to people, it came to mean a person of total honesty and sincerity (an undivided person). The modern term “data integrity” has come to mean data that is both whole and consistent (an undivided data asset).
In the late 1980s, a number of generic-drug companies were caught fabricating data and bribing Food and Drug Administration officials to gain approval for their less-expensive generic drugs. This scandal caused the FDA to shift their pre-approval inspections to focus on evaluating raw laboratory data, rather than the manufacturer’s conclusions. This raw data could not be altered or edited and needed to be honest and accurate.
Problems with misinformation from the pharmaceuticals industry continued, and in 2005, the FDA cited Able Laboratories for submitting false data and a failure to review data, including data audit trails. In 2006 and 2008, the FDA also issued warning letters to Ranbaxy about “data integrity” deficiencies. The FDA described a lack of data integrity when pointing out missing, or deliberately altered, data.
In 2008, a book titled “Operating Systems: Three Easy Pieces,” was published containing a chapter titled Data Integrity and Protection. In this chapter, Andrea C. Arpaci-Dusseau and Remzi Arpaci-Dusseau, two computer science professors, wrote about “disk failure” modes and “detecting corruption.” Their primary focus was on dealing with data storage system failures, or “corrupted data,” with an emphasis on maintaining the data’s consistency and accuracy.
Data integrity, prior to its being confused with Data Quality, was about keeping the data whole (intact and fully functional) until it is no longer needed. It supports processes and practices that determine how data is entered, transferred, and stored without being altered or corrupted. Avoiding “corrupted data” – data that has components that have been lost, distorted, or deliberately altered – is the primary goal of data integrity.
At present, data integrity can be defined as the maintenance and trustworthiness of data’s accuracy and consistency throughout its life cycle, with a priority on honest, or uncorrupted data.
Data corruption takes place when the data is deliberately or accidentally altered. Accidental changes can make the data unreadable, inaccessible, or unusable for researchers, or even other data applications. In many cases, the corrupted data can no longer be read by computer software, mobile apps, or web apps. Data corruption can also lead to system slow-downs, or simply freezing up a computer system.
Deliberate data corruption can be an effort to provide misinformation, with the goal of deception, or can be the result of a hacker or virus.
How Data Becomes Corrupted
There are a number of factors that can impact the integrity of data, including deliberate and/or malicious behavior. The most common sources of data corruption are listed below:
- Human error: Data can be corrupted by human error in a variety of ways. Sometimes, users may accidentally delete data, overwrite or replace a file, or mishandle the data collection or migration process.
- Compromised hardware: Defective or damaged hardware can corrupt data. Hardware issues can damage data as it is collected, processed, or stored, resulting in it becoming unusable. Ensuring the appropriate, undamaged hardware resources are being used will eliminate this problem.
- Incompatible systems: Data coming from another computer system may have incompatible formatting, which the receiving system cannot read. For example, the data sent from a NoSQL database may be incompatible with a MySQL database.
- Viruses and bugs: A form of malicious behavior, viruses and bugs can do terrible things. They can alter, delete, and manipulate data.
- The transfer of errors: Data errors can be transferred, or take place during the transfer. Occasionally, data packets are completely lost during the transfer process, creating an empty record on the receiver’s side. Additionally, transfer errors can take place if the receiver is unprepared to accept all the needed data attributes.
These issues can be avoided by following some basic rules, such as using error detection software, proper access controls, creating backups, and using validation techniques.
What Is Data Quality?
“Data Quality” describes the reliability of the data, its accuracy, and consistency. High-quality data is accurate and useful for good decision-making. Low-quality data describes data that contains faulty information and supports decisions that may damage the business. Data Quality is based on the data’s uniqueness, accuracy, timeliness, and consistency.
Plato used the word “quality” to mean a characteristic, which continues to be one of its meanings. During the Dark Ages, trade and manufacturing guilds applied a crude measurement system to the concept of quality (“poor quality, average quality, high quality”). High-quality data means data that is accurate for purposes of research and business intelligence.
Data of high quality should be:
- Unique: Duplicated data, or redundant data, not only has the potential to negatively affect statistical research, but can also produce interesting glitches, such as sending a customer the same product twice, with only one charge, or charging the same customer twice for a single purchase.
- Accurate: The collected data should not contain errors or misinformation. Data providing inaccurate information – because of human error, expired data, or ambiguous data – can result in costly mistakes. For example, using poorly or incorrectly titled data from the European region to predict Asian sales will provide inaccurate results, possibly creating a disaster for the business.
- Up to date: Data should be current and up to date. Old information can be even more dangerous than missing information (because of the assumption it’s still true).
- Consistent: There should be established, repetitive patterns for labeling, storing, and presenting data. All data records should be represented with consistent patterns to support efficiency and harmony within the workplace culture. Consider the confusion that could take place if different offices used two different date formats, such as America’s month/day/year and Europe’s day/month/year. (Would 12/10/23 fall in December or October?).
Most Data Quality issues are the result of human error and dysfunctional data collection policies.
Improving Data Integrity
Some steps can be taken to improve data integrity. Typically, a data corruption problem will present itself as soon as someone tries to work with it. The goal is to avoid having to deal with data corruption in the first place. Ways of improving data integrity are listed below:
- Compatibility: An organization may have data stored in relational databases, legacy systems, data warehouses, and in cloud-based apps, etc. Each of these storage systems comes with its own “language” and storage methods. Data integrity requires these systems be “aligned” and compatible with one another. In most cases, corrupted data becomes unreadable by computer software, web apps, or mobile apps.
- Automation: The use of automation minimizes human error, which in turn promotes data integrity.
- Security: Viruses and bugs, as well as hackers with malicious intent, can deliberately damage and distort data. Proper security can protect the data from viruses, bugs, and hacker attacks designed to make the data unusable.
- Backing up the data: Redundant storage systems can store data safely before it becomes corrupted, providing an emergency back version of the data.
- Useful software: There are a variety of software solutions that are designed to enhance data integrity.
Improving Data Quality
As with data integrity, there are ways to improve Data Quality. Ways of improving Data Quality are listed below.
- Correct data errors immediately: Identifying and correcting errors in the data quickly, before they can have any impact, can improve efficiency. The ETL (extract, transform, and load) process can be used to integrate data from multiple sources and store it as uniform, consistent data for later use.
- Eliminating data silos: Many large organizations have unintentionally developed data silos (isolated data storage) within different departments or other physical locations. This data is unavailable to the rest of the organization and can restrict research. Additionally, departments maintaining data silos are often prone to their own Data Quality issues. Centralizing the business’s data makes it more accessible and usable, and ensures all data is uniform and available for research.
- Collecting the right data: A business may collect significant amounts of data, but is it actually useful data? Is it collecting the correct information? Developing a collection process that focuses on the right questions and keywords, and avoids potentially useless or damaging websites, will improve efficiency.
- Promoting a data-driven culture: Developing a Data Governance program can be used to promote the development of a data-driven culture. Data Governance is a combination of software and cultural changes that promote the efficient use of data. It requires the participation of all staff and managers and uses a framework for the collection and use of high-quality data.
- Automation: The use of automation minimizes human error, in turn promoting Data Quality.
Image used under license from Shutterstock.com