The Business View of Data and Data Quality The Six Dimensions of Semantic Quality

Click to learn more about author Ronald G. Ross.

This is the second and final part of this series on Data Quality and Semantic Quality. You can access the first part here: The Business View of Data and Data Quality: The Six Dimensions of Semantic Quality – Part One

Forming High-Quality Business Communications

Rather than retroactively focusing on data already formed, business people and professionals need proactive measures to form high-quality messages in the first place – again, no matter whether structured data or ‘unstructured data’.

What should the recipients of blind messages expect? They have the right to expect:

High-quality evidence about what the content means.
No need for any significant assumptions, whether unconscious or deliberate, to supplement that evidence.
The content representing exactly the reality the evidence suggests.

What form does evidence available to recipients take?

names including codes[1]
definitions
concept model
business rules
documentation of purpose
scope of the need

The dimensions of Semantic Quality arise directly from these six kinds of evidence, respectively. They provide the context for blind communications. The six dimensions are discussed individually below with examples.

Readable

A readable message is one that is not unintentionally encoded or cryptic; that is, one whose meaning is not obscured by choice of signifiers (names or codes). If a message is encrypted (as security of course usually demands these days) the encryption should be on top of the message, not a by-product of forming the message (data) itself.

Cryptic names and codes are rampant in IT systems; they are encouraged by programming languages, software platforms, and legacy computer tradecraft. Here are some typical examples:

PID-RAD2-TYPE. Who but programmers might know what this field name would represent?!
A coding scheme for the values of a field where ‘0’ stands for ‘no’ and ‘1’ stands for ‘yes’. Why?!
The abbreviation ‘PT’. Without adequate evidence, this abbreviation could stand for many things, including the following:

PT Emp –> Part-time employee
PTCRSR –> PT Cruiser (Personal Transportation Cruiser)
Blk pt chassis –> Black platinum chassis
24pt bk –> Manual published in 24-point type
2 pt asbl –> Two-part assembly
1 pt –> One pint
LIS PT –>Lisbon, Portugal

Making names and codes readable is not always easy. For example, concepts that are highly computed or derived are often difficult to label succinctly with names or codes of reasonable size. Supplemental evidence is even more necessary in those cases. For the basic or elemental concepts of a problem domain, however, there is simply no excuse for cryptic names or codes.

Understandable

An understandable message uses only standard terms that have solid business (not data) definitions. Failings in this regard can arise from:

Naming things wrong or obscurely – a term is used for a concept that could easily or subtly be misconstrued.
Defining things poorly or inaccurately – a term’s definition is absent, unclear, imprecise, incomplete, and/or un-business-like.

Naming and defining things – that is, creating a solid business vocabulary – is the fundamental purpose of a concept model. A good concept proves adequate vocabulary to support discriminating messages, indicating precisely the right word(s) to use for a given concept.

For example:

Suppose someone calls something a site. The subject matter has to do with immunology. Does site refer to a location where a vaccination took place (e.g., a doctor’s office), or to an anatomical location where a vaccination was injected. A good concept model would provide terminology to clearly distinguish the two concepts.
Suppose someone calls something a loss. This designation could mean either the event of a loss (e.g., a house burns down) or the amount of the loss (e.g., the house was a 50% loss). A good concept model would avoid this ambiguity, perhaps by offering two distinct terms loss event and loss amount.[2]
Suppose someone says vaccination. This term could either mean a whole vaccination series (if a particular vaccination requires more than one dose) or an injection of any one dose in a particular vaccination series. A good concept model will provide distinct vocabulary to talk about both meanings, as needed.
Suppose someone says person has vehicle. This verbal connection could mean any of the following: person owns vehicle, person leases vehicle, person borrows vehicle, or person steals vehicle. Assuming the need, a good concept model will provide distinct wordings for each of these meanings.

Precise

A precise message is one that uses terms and wordings from a concept model correctly.

In subject matter of any complexity – which is to say virtually all business subject matter – precision in word choice can make a huge difference in the ultimate effectiveness of a communication. There is simply no word like exactly the right word.

Sometimes the choice of word for some concept in a message is simply wrong. Such usage can be highly misleading. For example:

Using extension to mean ‘an offering of a product given to a prospect when the prospect clicks on an ad’, rather than how the concept model defines it, ‘an additional period of time given to a prospect to accept an offer’.[3]
Using borrower to refer to a party that completes a loan application. A party becomes a borrower only if their loan application is approved and funded. Assuming a robust concept model, the error of such usage would be obvious.

A robust concept model addresses the deeper semantics from which misunderstandings and misinterpretations of terminology often spring. Root causes of ambiguity can often be eliminated only by reflecting each concept’s logical connections with other concepts. By addressing these connections, a concept model actually represents more than just a business vocabulary; it represents a structured business vocabulary.

Examples of this kind of problem in creating messages:

Use of a term for a concept where a role name would be more accurate, or vice versa (e.g., ‘party’ vs. ‘applicant’ vs. ‘owner’ vs. ‘leaser’).[4]
Use of a term for a concept where a term for one of the concept’s categories or the concept’s super-category would be more accurate (e.g., ‘limited liability corporation’ vs. ‘corporation’ vs. ‘party’).
Use of a term for a concept where a term for either the whole or a part of the whole would be more accurate (e.g., ‘chassis’ vs. ‘vehicle’).
Use of a term for the class of a thing where a term for the thing itself should be used, or vice versa (e.g., ‘tower’ vs. ‘Eiffel Tower’).

Reliable

A reliable message is one that is compliant with all relevant business rules.

Much confusion arises over business rules. Professionals who work with data/system architectures often have a technical view of them. That’s off-target. Business rules are not data rules or system rules. A true business rule is a criteria for shaping behavior or making decisions in actually running the business. Business rules are about shaping business activity, not data – at least directly.

I recently read the following statement about Data Quality: “Business rules capture accurate data content values.” No. Business rules are about running the business correctly.

If the business is run correctly then of course, its business communications will be formed correctly. If its business communications are formed correctly, then the content of its data/system architecture will also be correct. So yes, business rules result in correct data, but more importantly correct data arises because business activity is conducted correctly in the first place.

In other words, Data Quality isn’t really about the quality of your data, it’s about the quality of your business rules.

Problems with Data Quality arising from failure to consistently follow appropriate business rules in business activity (or following the wrong rules or following no rules at all) are often illustrated by very simple examples such as the following. Don’t be fooled! These examples barely scratch the surface – they’re just happen to be relatively easy to talk about.

Data in a field is invalid because it violates some definitional business rule(s) – for example, social security numbers are found in a field for the last name of a person. Reasonable definitional rules would disallow numeric values.
Data in a field is invalid because it violates some minimum or maximum threshold – for example, a number greater than 99 is found in a percentile field.
Numeric data in a computed field fails to comply with some computation rule – for example, social security tax is calculated incorrectly.
Alpha data in a derived field fails to comply with some derivation rule – for example, a valid-candidate-for-insurance flag is set to ‘yes’ although the person has been convicted of a felony involving a motor vehicle.

Each of the problems above basically addresses values of just a single field. Often, data in one or more fields can collectively violate some business rule(s) so as to represent a conflicting or prohibited business situation. The following examples illustrate. Each example is first expressed as a business rule, then as a corresponding data constraint in a pseudo constraint language. Incidentally, these examples also illustrate the difference between communicating in business terms vs. communicating in data-speak.

Business rule: A customer must have an assigned agent if the customer has placed an order.

Expressed as a corresponding data constraint: To be correct, valid data is required in the assigned-agent field of an order record if any orders are listed for that order record.[5]

Business rule: A claim may include at most only one of an assigned adjudicator or a litigating lawyer.

Expressed as a corresponding data constraint: To be correct, no data is permitted in the assigned-adjudicator field of a claim record if data appears in the litigating-lawyer field of that record, and vice versa.

Business rule: The payee of a claim payment for a claim must be a party who makes the claim.

Expressed as a corresponding data constraint: To be correct, any data in the payee field of a claim-payment record must indicate a party who is one of the parties listed as having made the claim.

Business rule: A loan application for a subject property that is subject to litigation may be approved only if the income of the applicant is GT 20% of the estimated value of the property.

Expressed as a corresponding data constraint: To be correct, a loan-application record linked to a subject-property record that is flagged as being subject to litigation must not be flagged as approved if the (numeric) data in the income field of the applicant record is not more than 20% of the (numeric) data in the estimated-value field of the property record.

Useful

A useful message is one that is fit for business purpose. Assessing fitness for business purpose requires that the purpose of the message, and all others like it, is stated or described explicitly. For example, the purpose might be “to take and fulfill orders for products of a certain kind”. If all other Semantic Quality dimensions are satisfied, the message will be deemed useful for that purpose. For other purposes it is likely to prove less useful – or not useful at all.

In the simplest terms, the purpose of a message is documentation that explains the intended use of the message. That purpose and its exact scope might not be fully evident from the content of the message itself. The purpose is evidence that can be inspected to pin down the exact context in which the message is meaningful or relevant. For example, given the purpose “to take and fulfill orders for products of a certain kind” a message might not be useful for products not of that certain kind.

To be useful, the parties accessing the message need to have agreed to the purpose explicitly.

Actively participating parties (e.g., ones inside the immediate value stream of business activity) should agree to the common purpose in advance. Ideally, this purpose should arise from, and be aligned with, an explicit business strategy for the problem domain.
Passively participating parties (e.g., ones outside the immediate value stream of business activity) should explicitly acknowledge and accept the purpose when they eventually attempt to make use of the message.

Sufficient

A sufficient message is one that can be taken with confidence to satisfy a need. Assessing whether a message satisfies a need requires that the need, like purpose, be agreed explicitly. Unlike purpose, however, the need does not need to be documented explicitly. Instead, the parties must agree that messages of all types falling within scope can convey everything necessary to satisfy the purpose.

In effect, the parties are simply stipulating that the concept model is complete. If some vocabulary needed in some message to satisfy the purpose cannot be found in the concept model, then the need cannot be fully satisfied. There would literally be no words (standard business vocabulary) to talk about it. Therefore, it could not be communicated (at least with any confidence).

For example, suppose the purpose is again “to take and fulfill orders for products of a certain kind”. Someone on the receiving end of a message within scope needs ‘wheel diameter’, but ‘wheel diameter’ cannot be found in the concept model. As a consequence, the message cannot legitimately supply it; the message is therefore not sufficient.

The concept model thus plays a sovereign role in determining whether any message within scope can be deemed sufficient.

Summary

The six dimensions of Semantic Quality get to root causes of ‘Data Quality’ problems. Communicating about difficult subject matter is hard to begin with. Blind communication to people you can’t converse or interact with directly is the hardest of all. It requires order-of-magnitude sophistication in the techniques used to form the messages. Concept models and business rules provide the necessary tools.

Footnotes

[1] From a business communications perspective, a code used as a stored value in a field is actually a name for a shared concept about some thing. That thing sometimes, but by no means always, exists in the real world. For example, the real MA (Massachusetts) can’t be put in a file or database (way too large!). The code ‘MA’ is the name for our shared understanding (concept) of that U.S. state as it exists in the real world.

[2] There are many words in English that are used loosely either for some thing, or for a quantity of that thing – words like tax, charge, deduction, service, etc. Such words generally make poor terms without qualification.

[3] Perhaps even worse is being inconsistent in usage – e.g., sometimes the term means one thing, and sometimes another. Such terms are called homonyms (one word or word phrase, but multiple meanings). Synonyms – different words or word phrases standing for the same meaning – also present challenges for effective communication, though generally not as difficult.

[4] Role names are always related to verb concepts – i.e., named connections or characteristics pertaining to noun concepts. For example, applicant arises from the verb concept party applies for mortgage. Owner arises from party owns property. Leaser arises from party leases property.

[5] Refer to Business Rule Concepts: Getting to the Point of Knowledge (4th ed), Ronald G. Ross, 2013, pp. 99-100.

TRAIN TO GET CERTIFIED AS A DATA QUALITY SPECIALIST

Data Topics

The Business View of Data and Data Quality The Six Dimensions of Semantic Quality – Part 2

Leave a Reply Cancel reply