Making Machine Learning Datasets Unbiased

Click to learn more about authors Dmitry Pozdnyakov and Olga Ezzheva

Machine Learning (ML), a subset of a broader Artificial Intelligence (AI) field, is finding its way into more and more areas of application. From smarter shopping recommendations to better medical diagnosis to more effective fraud detection, businesses are leaning on ML to inject new efficiencies into their workflows and support their decision making.

As the powerful technology starts to produce a greater impact on society, it gives rise to valid concerns about algorithmic bias and transparency Speaking of bias, tech giant Amazon had to get rid of its AI-enabled recruitment system because it favored men over women.

Machine Learning solutions consume massive amounts of data, identify even slightest correlations, and predict an outcome. But to deliver these results, ML models first need to be trained with a training dataset that will serve as a benchmark. If any bias creeps in those datasets used for teaching, an ML algorithm will only further amplify it, undermining the integrity of any decision based upon such predictions.

In case with Amazon, the bias came from training the system with 10-years’ worth of resumes submitted to the company. And as should be expected for a male-dominated tech industry, these resumes came mostly from men.

Trained with poor and biased data, a Machine Learning algorithm is unable to deliver an accurate forecast. So how can you remove or minimize bias in your Machine Learning datasets in the first place?

Make Sure Your Datasets are Representative

In 2016, the first beauty contest judged by AI revealed controversial results — out of 44 winners, there was only one with dark skin, some were Asian, and the rest were white. However, the participants came from 100 different countries, including large groups from India and Africa. And while the algorithm was not intentionally trained to like white people better, there just were not enough minorities in the training data to determine human beauty.

Making training datasets representative and balanced is key to a viable ML model that would not yield unintended or even offensive results. Think of all user groups that your product will serve: are they all adequately represented? By analyzing your training dataset from the perspective of an end user, you may be surprised to find some gaps that will require collecting additional data.

Another technique to handle imbalance in a dataset is resampling. To minimize unwanted distortions, you may add instances from an underrepresented minority class called oversampling, or delete instances from an overrepresented class called undersampling.

Keep Only Relevant Variables

Sensitive personal attributes like gender and race are known to introduce bias and discrimination into ML algorithms. The above-mentioned Amazon’s ML-powered recruitment system showed gender bias against women.

While controlling for specific input parameters like gender, race, or age is a necessary first step, it is not enough. Predictive ML algorithms can still learn these biases from other variables since they are interrelated. Zip codes, for example, can be related to income and race, profession to gender. Stripping your training dataset down to only relevant components will help reduce potential disparities and result in a fairer prediction.

Engage External Experts

Created by humans, Machine Learning algorithms can easily pick up the biases of their creators. An ML model that uses historical data to predict outcomes will inadvertently reinforce any bias found in past decisions, metrics, or parameters. It should be noted that the smaller the group of people responsible for decisions is, the higher the risk of bias will be.

One of the ways to combat these past injustices is to diversify your data scientist team. People with different backgrounds and life experiences will provide a fresh and even unexpected perspective to the problem at hand, helping to balance out the training dataset and make it more neutral. Some companies even invite domain experts from the outside to audit the company’s past practices so as not to bake past biases into their Machine Learning algorithms.

Keep Humans in the Loop

It’s erroneous to think that once an ML model is trained and put in the wild, it does not need human supervision any longer. An algorithm predicting house prices, for example, will require regular re-training with fresh, up-to-date data since the prices tend to change all the time, and predictions will become inaccurate before you know it.

To ensure your Machine Learning algorithm continues to deliver accurate, unbiased outcomes, you need to remain vigilant and continue monitoring your Machine Learning model even after the launch. By frequently checking your algorithm performance against a set of indicators that reflect non-discrimination, you will be able to detect bias early on and correct the ML model by isolating and removing a problematic variable from the training dataset.

Wrapping up

A powerful tech, Machine Learning lends itself well to our data-driven world and helps businesses turn massive amounts of data into digestible insights. But with Big Data fueling Machine Learning algorithms, bias in data remains ML’s Achilles heel.

Bias tends to seep into training datasets through sensitive attributes, interrelated variables, and under or overrepresented categories. To avoid baking this bias into an ML algorithm, clean your training data, recognize potential distortions, and take measures to eliminate them.

LISTEN NOW: MY CAREER IN DATA PODCAST

Data Topics