What’s the connection between Data Quality and Blockchain? Dan Myers, principal information quality educator at DQMatters and the publisher of The Conformed Dimensions of Data Quality — which seeks to align the Data Management community by using common definitions — addressed the issue during his Enterprise Data World Conference presentation titled “Improved Data Quality Implications with Blockchain”
Blockchain should improve Data Quality significantly. It is all about ownership, he noted —whether of money, real estate, or anything else — and about its transfer, and how that data is recorded. Ownership needs to be established by identification of the owner(s), identification of the object(s) they own, and a mapping between owners and objects.
Blockchain offers advantages such as the ability to provide full audit trails of transactions and verify entities that align with Data Quality improvements as defined in the Confirmed Dimensions of Data Quality: completeness, accuracy, consistency, validity, timeliness, currency, integrity, accessibility, lineage, and representation.
Myers drew upon a theoretical example of how Blockchain can improve car-related transactions. Multiple steps of a re-envisioned business process complementary to Data Quality dimensions will take place: The paper title can be distributed digitally, in which case Blockchain connects the real-world to the digital one with both a printable and digital title; and where improvements to Data Quality come in the form of representation of ownership and verification of such to the Blockchain. Or a private car sale using Blockchain protects the privacy of home and mailing addresses while providing validation of entity and authority to sell, enabling Data Quality improvements for accuracy of sale price and the integrity of the parties conducting the business transaction.
Blockchain Background — Data Quality Dimensions
Operating in a peer-to-peer system of ledgers, Blockchains are stored network nodes, and use software units that consist of algorithms that negotiate information content of ordered and connected blocks of data together with cryptographics and security technologies. “From an architectural perspective, you roll all these up into a logical solution to achieve and maintain its integrity,” Myers said.
As Blockchain relates to cryptography, for example, public keys exist to manage identities and private keys for authorization, which are important for Data Quality. Transactions between unique individuals take the form of identifying individuals by their public keys and using their signatures that relates to their private keys to document agreement and approval of each transaction. “You conduct transactions and they are enclosed or agreed-upon by the addition of a key that seals the envelope of the transaction,” he said. “The real key here is unique individual identification.”
As it relates to client-server and distributed ledger architecture, a copy of a whole Blockchain is stored in a distributed context on each node. Each participant maintains, calculates, and adds new entries into their own ledger and syncs up with all the others. “Distributed consensus is used to ensure that all nodes come to the same conclusion,” he said.
As it relates to integrity — an immutable data structure — he used the analogy of a book, whose pages are sequential. “If you are reading along and read something that refers to another thing, you need the integrity of the book — its sequence — so you read that page,” he said. But if a page is removed, readers know that the integrity has been compromised.
As it relates to integrity and transaction hash values in the data structure, the genesis block creates a hash, or digital signature, of the first block that includes the first piece of data. The second transaction creates a hash of both the second piece of data and the hash from the prior transaction, thereby connecting the two. So, each data block is uniquely identified via each hash, which is highly unlikely to be duplicated. With these unique keys, it’s possible to link or join to other “off‐chain” data structures, ensuring referential integrity between the transaction in the chain and additional data stored elsewhere.
Consistency and integrity are the top two dimensions of Data Quality that are the most applicable to Blockchain. “The equivalence or redundancy of distributed data is a measure of the similarity of other sources of data that represent the same concept,” he said — that is, consistency. Back to the car example, the buyer knows that the seller has sold the car, but others won’t until the distributed ledger is synched up and there is agreement among as many people as required to ensure consistency about the transaction. Integrity, as defined in the Conformed Dimensions of Data Quality, is robust for the Blockchain purpose because it measures the structural or relational quality of data sets.
Questions to Be Answered, Issues to Be Addressed
There are, Myers believed, questions surrounding unique keys linking to off-chain data structures. How, he asked, will it be possible to handle the quality of data stored off-chain, because people will want to pump volumes of data into the Blockchain? “You don’t usually put a lot of data in a Blockchain for performance reasons and distributed ledger implications,” he said. “The CEO may come and say ‘let’s take down the Data Warehouse and just put everything in it in a Blockchain so that it’s all reconciled and there are no Data Quality problems.’ That’s not the way to fix things,” he said.
“How much data you put in the Blockchain is a big architectural question that people are trying to figure out right now,” he said.
There’s also the Data Quality issue that can arise because data integrity is not the same as data accuracy. Poor data might be correctly entered into a system, “but that doesn’t represent the real world accurately,” he noted.
Myers has suggested that businesses consider the following as they move ahead with Blockchain:
- Know what is “owned” and who the vendors are and know your business’s Data Quality needs.
- Understand components of Blockchain and which you really need.
- Research alternative architectures/methods to see if there are the similar gains to be had with less hype and risk/cost.
- Move forward with Blockchain/distributed ledger technology with eyes wide open.
- And determine whether everything has to be on the chain and why.
Want to learn more about DATAVERSITY’s upcoming events? Check out our current lineup of online and face-to-face conferences here.
Here is the video of the Enterprise Data World presentation:
Image used under license from Shutterstock.com