Data Drift vs. Concept Drift: What Is the Difference?

*Read more about author Gilad David Maayan.*

Model drift refers to the phenomenon that occurs when the performance of a machine learning model degrades with time. This happens for various reasons, including data distribution changes, changes in the goals or objectives of the model, or changes to the environment in which the model is operating. There are two main types of model drift that can occur: data drift and concept drift.

Data drift refers to the changing distribution of the data to which the model is applied. Concept drift refers to a changing underlying goal or objective for the model. Both data drift and concept drift can lead to a decline in the performance of a machine learning model.

Model drift can be a significant problem for machine learning systems that are deployed in real-world settings, as it can lead to inaccurate or unreliable predictions or decisions. To address model drift, it is important to constantly monitor the performance of machine learning models over time and take steps to prevent or mitigate it, such as retraining the model on new data or adjusting the model’s parameters. These monitoring and adjustment systems must be an integral part of a software deployment system for ML models.

Concept Drift vs. Data Drift: What Is the Difference?

Data Drift

Data drift, or covariate shift, refers to the phenomenon where the distribution of data inputs that an ML model was trained on differs from the distribution of the data inputs that the model is applied to. This can result in the model becoming less accurate or effective at making predictions or decisions.

A mathematical representation of data drift can be expressed as follows:

P(x|y) ≠ P(x|y’)

Where P(x|y) refers to the input data’s probability distribution (x) given the output data (y), and P(x|y’) is the probability distribution of the input data given the output data for the new data to which the model is applied (y’).

For example, suppose an ML model was trained on a dataset of customer data from a particular retail store, and the model was used to predict whether a customer would make a purchase based on their age, income, and location.

If the input data’s distribution (age, income, and location) for the new data fed to the model differs significantly from the distribution of the input data in the training dataset, this could lead to data drift and result in the model becoming less accurate.

Overcoming Data Drift

One way to overcome data drift is to use techniques such as weighting or sampling to adjust for the differences in the data distributions. For example, you might weight the examples in the training dataset to more closely match the input data distribution for the new data that the model will be applied to.

Alternatively, you could sample from the new data and the training data to create a balanced dataset for training the model. Another approach is to use domain adaptation techniques, which aim to adapt the model to the new data distribution by learning a mapping between the source domain (the training data) and the target domain (the new data). One way to achieve this is by using synthetic data generation algorithms.

Concept Drift

Concept drift occurs when there is a change in the functional relationship between a model’s input and output data. The model continues to function the same despite the changed context, unaware of the changes. Thus, the patterns it has learned during training are no longer accurate.

Concept drift is also sometimes called class drift or posterior probability shift. This is because it refers to the changes in probabilities between different situations:

Pt1 (Y|X) ≠ Pt2 (Y|X)

This type of drift is caused by external processes or events. For instance, you might have a model that predicts the cost of living based on geographic location, with different regions as input. However, the development level of each region can increase or decrease, changing the cost of living in the real world. Thus, the model loses the ability to make accurate predictions.

The original meaning of “concept drift” is a change in how we understand specific labels. One example is what we label as “spam” in emails. Patterns such as frequent, mass emails were once considered signs of spam, but this is not always the case today. Spam detectors that still use these outdated attributes will be less effective when identifying spam because they have concept drift and require retraining.

Here are more examples of concept drift:

The impact of changes to the tax code on a model that predicts tax compliance
The impact of evolving customer behavior on a model that predicts product sales
The impact of a financial crisis on predictions of a company’s profits

Concept Drift vs. Data Drift

With data drift, the decision boundary does not change; only the probability distribution of the inputs change – P(x). With concept drift, the decision boundary changes, with both the input and output distribution changing – P(x) and P(y).

Another important difference is that data drift is mainly the result of internal factors, such as data collection, processing, and training. Concept drift typically results from external factors, such as the situation in the real world.

Strategies to Detect and Overcome Data and Concept Drift

There are several strategies that can help detect and overcome model drift in a machine learning system:

Performance monitoring: Regularly evaluating the performance of the ML model on a holdout dataset or in production can help to identify any decline in accuracy or other metrics that may indicate model drift.
Data and concept drift detection algorithms: There are algorithms specifically designed for detecting data drift, such as the Page-Hinkley test or the Kolmogorov-Smirnov test, as well as algorithms that detect concept drift, such as the ADWIN algorithm. These algorithms can automatically identify changes in the input data or task that may indicate model drift.
Data and concept drift prevention techniques: These techniques can help prevent data or concept drift from occurring in the first place. For example, using data augmentation or synthetic data generation can help to ensure that an ML model has exposure to a wide, representative range of data, which can make it more resilient to shifts in the data distribution. Similarly, using transfer learning or multitask learning can help the model to adapt to a changing task or objective.
Retraining and fine-tuning: If model drift is detected, retraining or fine-tuning the model on new data can help to overcome it. This can be done periodically, or in response to significant changes in the data or task.

By regularly monitoring for model drift and taking proactive steps to prevent or mitigate it, it is possible to maintain the accuracy and reliability of machine learning models over time.

Conclusion

In conclusion, data drift and model drift are two important phenomena that can affect the performance of machine learning (ML) models.

Data drift, also known as covariate shift, occurs when the distribution of the input data that an ML model was trained on differs from the distribution of the input data that the model is applied to. Model drift, also known as concept drift, occurs when the statistical properties of the data that an ML model was trained on change over time.

Both data drift and model drift can lead to the model becoming less accurate or effective at making predictions or decisions, and it is important to understand and address these phenomena in order to maintain the performance of an ML model over time.

There are various techniques that can be used to overcome data drift and model drift, including retraining the model on updated data, using online learning or adaptive learning, and monitoring the performance of the model over time.

LISTEN NOW: MY CAREER IN DATA PODCAST