Click to learn more about author Tejasvi Addagada.
The first Data Quality challenge is most often the acquisition of right data for Machine Learning Enterprise Use cases.
Wrong Data – Even though the business objective is clear, data scientists may not be able to find the right data to use as inputs to the ML service/algorithm to achieve the desired outcomes.
As any data scientist will tell you, developing the model is less complex than understanding and approaching the problem/use-case the right way. Identifying appropriate data can be a significant challenge. You must have the “right data.”
So, What Does it Mean to Have the Right Data?
Throughout the rest of this article, we describe the characteristics of the appropriate data for your analytical situation. To start with, you may have identified a set of attributes in your organization’s daily transactions such as the “channel last used”, which are likely predictors of customer behavior, but if your organization isn’t collecting these data points currently, you have a challenge. Having been caught in this situation, many data scientists believe that the more data collected the better.Another option, however, is to clearly scope the collection of data required for the use case based on research about the sensitivities and relationships between existing data attributes. In other words, know your business and its data. By doing this up front, you’ll ensure that the time spent collecting new data isn’t wasted. Knowing this also enables you to define Data Quality rules that ensure that data is collected right the first time.
Coverage of Data as a Data Quality Dimension, for ML Use Cases
In the financial services space, the term Coverage is used to describe whether all of the right data is included. For example, in an investment management firm, there can be different segments of customers as well as different sub products associated with these customers. Without including all the transactions (rows) describing customers and associated products, your machine learning results may be biased or flat out misleading. It is acknowledged that collecting ALL of the data (often from different sub entities, point of sale systems, partners etc.) can be hard, but it’s critical to your outcome.
More broadly speaking, Coverage, can be categorized under the Completeness Dimension of Data Quality and called the Record Population concept within the Conformed Dimensions standard. This should be one of the first checks to be performed before proceeding to other Data Quality checks.
With standard resources like the Standard Dimensions from IQ International, that explain the meaning of all of the possible Data Quality issues, you will not be bogged down in the cleansing work required later.
What are the Conformed Dimensions?
They are a standards-based version of the dimensions of Data Quality that you’re already familiar with, such as Completeness, Accuracy, Validity, Integrity and other dimensions etc. The IQ International Standards offer robust descriptions of sub-types of data quality called Underlying Concepts of a Dimension and example Data Quality metrics for each of these. The use of these Data Quality standards at the beginning of your exercise will ensure your data is the most fit for your purpose. In the next blog post, we’ll discuss what you can do when you don’t have enough data for the training phase of your project.
Thanks to Dan Myers, from IQ International for co-authoring.