Measuring Data Consistency

Measuring data consistency can tell a researcher how valuable and useful their data is. However, the term “data consistency” can be confusing. There are three versions of it. When the term is applied to databases, it describes data consistency within the database. When used with computing strategies, data consistency is focused on the use of data caches. The third version of data consistency is used with data analytics.

Generally speaking, data consistency deals with format transformations, duplicated data, and missing information.

Data “inconsistency” causes problems, including a loss of information and results that are incorrect. Data consistency, on the other hand, promotes accuracy and the usability of available data and may be the difference between a business’s success or its failure. Data has become the foundation for making successful business decisions, and inconsistent data can lead to misinformed business decisions.

The tools mentioned in this article are used with SQL systems.

Data Consistency in Databases

A database is a systematic, organized collection of data. It supports electronically stored data in a computer system, and allows the data to be altered. A database makes it easy to manage data. Database consistency is based on a series of rules that support uniformity and accuracy, and uses “transactions.”

A database transaction is a process that is executed independently for purposes of data retrieval or updates.

A database transaction, by definition, should be ACID- compliant (“ACID” stands for atomic, consistent, isolated, durable). The “consistent” feature helps to ensure data consistency in each transaction. The features of ACID guarantee the data’s validity despite power failures, errors, and other issues.

Ideally, a database transaction should follow the all-or-none law. (The writing should be complete or it should not be written). All of the validation rules must be in place to ensure consistency. If the rules supporting uniformity and accuracy are not followed, the entire transaction will be canceled.

Database consistency rules require that data be written and formatted in ways that support the system’s definition of valid data. If a transaction occurs that attempts to introduce inconsistent data, the entire transaction is rolled back and returned to the user.

A consistent modern database contains data that is valid per clearly defined rules, which includes cascades, triggers, and constraints. Database transactions must only change the affected data.

Database storage that, by default, offers consistency across an entire dataset, produces fewer glitches and problems in general.

A lack of data consistency significantly increases the chances data within the system is not uniform, which would result in missing or partial data. There are normally three kinds of data consistency:

Point-in-time consistency focuses on ensuring all data within the system is uniform at a specific moment in time. This process prevents a loss of data if the system crashes or there are other problems in the network. It operates by referencing bits of data in the system by way of timestamps and other consistency markers. This allows the system to restore itself to a specific point in time.

Transaction consistency is used to detect incomplete transactions and roll back the data if an incomplete transaction is found.

Application consistency works with the transaction consistency that exists between programs. If a banking program is communicating with a tax program, application consistency promotes uniform formats between the two.

Ensuring that a computer database has all three elements of data consistency covered is the best way to ensure data is not lost or corrupted as it travels throughout the system.

Measuring Data Consistency in Databases

Testing the consistency of data in a database is relatively easy. A “database consistency checker” (DBCC) can be used to measure the data’s consistency. These checkers’ help to ensure both the logical and physical consistency of a database. It should be noted that many DBCCs do not make automated corrections, and the problems must be corrected manually. It is recommended that periodic checks are made to ensure the logical and physical consistency of your data. (There are some more-evolved database consistency checkers that make some corrections.)

According to Microsoft, when using their cloud, the best way to repair database errors is by comparing the current database with the last good backup.

The Consistency of Caches

“Caching” is storing data that is accessed frequently in a convenient, nearby location (called a cache). Distributed caching is an extension of the caching technique, with the cache being distributed across, and accessible by, multiple servers or machines.

Distributed caching is an extremely useful tactic designed to improve the performance and speed of applications. Distributed caches are often used to power several high-traffic websites and web applications. This allows data to be retrieved more quickly and efficiently.

Distributed caches typically use distributed hashing, which makes use of an algorithm called consistent hashing. A hash function is used to map one piece of data—and normally identifies an object for another piece of data, called a hash code, or a hash.

Typically, the cache will store entries for short periods of time, after which they are erased or updated. If the entries are updated every five minutes, then inventory may be five minutes old, and out of date. This delay creates a “window of inconsistency” that can cause problems with customer expectations if the database has different, accurate information.

Improving the Consistency of Caches

Striim, a cloud and platform provider, has developed a tool for resolving this window of inconsistency. It is called the Hazelcast Striim Hot Cache, and it solves the problem by using streaming data to synchronize and update the cache in real-time. As a result, both the cache and the associated application are consistently updated in real-time.

Their high-speed messaging layer works to route an event (data updates) to land on the correct node—the node that actually has the data stored locally within that cache. This is done with the use of a consistent hashing algorithm applied to the messaging layer and the cache layer.

Data Consistency in Analytics

The data accessed for analytics normally comes from a variety of sources using different formats. The number of variations depends on the amount, or volume, of data being collected. When working with data analytics, data consistency is a part of the data integration process.

Because the data for analytics comes from a number of sources, the data can be presented in several formats.

Data integration platforms provide a way to integrate the data taken from multiple sources, and transform them into a single, uniform format. (Data “value” conflicts cannot be corrected with data integration methods.)

Data consistency differs from data integrity. Data integrity focuses on the quality of the data, or its accuracy. It strives to eliminate errors and redundant information, and to fill in missing information. Data consistency acts as one support for data integrity, and focuses on formatting and constant updating of the data.

Data consistency, as a support for data integrity, ensures users of the data share the same view of the data, including changes that were made by the user and changes made by others. Data inconsistency presents variations of the same data in different locations.

Measuring Data Consistency in Analytics

The Boomi platform offers tools for finding consistency problems, measuring them, and correcting them.

The term “data wrangling” is used on the Boomi website to describe the transformation of data into another format, making it accessible for such things as analytics. Developers who make the transformations are called data wranglers.

The Boomi Hub can provide the clean, accurate data needed for gathering data critical to business. With the Boomi Hub, data integration rules and data enrichment services can be used to trap bad data before it spreads to other systems. Boomi can synchronize business data, improving accuracy, consistency, and completeness.

Image used under license from Shutterstock.com

ATTEND OUR LIVE ONLINE DATA MANAGEMENT FUNDAMENTALS COURSE

Data Topics