
A lot of times, ethical issues in AI systems arise from the most mundane types of decisions made about data such as how it is processed and prepared for machine learning (ML) projects. I’ve been reading Designing Machine Learning Systems by Chip Huyen, which is filled with practical advice about design choices in machine learning applications, giving rise to this month’s question …
How do technical data choices in machine learning lead to ethical issues?
What Exactly Is “Learned” from Data?
When I was doing my graduate research about applied AI ethics in healthcare, one of my interview subjects told me a fascinating story about radiology scans. This AI researcher shared that they had entered competitions using various data sets, some of which had been collected from hospitals using Siemens equipment and others collected from hospitals using GE equipment. Their model could make accurate predictions, but not because of the content of the scan. Instead, the model had learned a pattern related to the manufacturer of the equipment. They shared that:
“Data is actually different depending on where it is coming from and it’s not different because of biological reasons but because of the technical differences that happen because of acquisition.” (Regan-Ingram, 2020)
Obviously, this isn’t quite what we have in mind for machine learning models when it comes to making predictions. Yet, these kinds of details about data matter because if we aren’t aware that this can happen, how can we address it when designing machine learning systems? With this in mind, let’s look at a couple of technical data decisions from Huyen’s book that have ethical implications.
Missing Values
Real-world data is messy. How we choose to clean it matters. One of those choices involves what to do about missing values. Do we just delete them and throw out the data altogether? That might be convenient because it is easy to do, but it will skew the sample, perhaps in consequential ways.
Maybe we should try to impute – or estimate – the missing values? Is that possible to do accurately? How do we know what technique works best and what are the implications around using a particular technique?
We are making an ethical choice regardless of what we do. As Huyen puts it:
“There is no perfect way to handle missing values. With deletion, you risk losing important information or accentuating biases. With imputation, you risk injecting your own bias into and adding noise to your data, or worse, data leakage.” (Huyen, 2017)
Knowing which kind of missing data we’re dealing with is an important first step in deciding what to do.
- Missing not at random (MNAR): Data is missing for reasons related to the value of that data. In other words, there is a reason for that particular data not being disclosed. For example, heavy smokers might be most reluctant to disclose their smoking habits.
- Missing at random (MAR): Data is missing because of another observed variable. For example, gender = female might result in age = none of your business … in other words missing data.
- Missing completely at random (MCAR): Data is missing for reasons that have nothing to do with any of the variables in the dataset. For example, someone forgot to fill in a value in a survey. It should be noted that according to Huyen, this type of missing data is rare. Usually there is a reason for missing data.
Once we know why the data might be missing then we can determine the best course of action. For example, if there is a small amount of the data that is MCAR, one could delete the rows. But, if that data is MNAR, than we might be removing important samples that would be useful in making predictions, because the missing data itself might be part of what is interesting about the sample. Removing rows can also add bias if the data is MAR. Building on our earlier example, if we remove all the ages that are missing, we would also be removing gender = female and skewing the dataset. Removing a column, or feature, instead of the rows might seem like a good idea if there is a lot of missing data for that column. However, this has implications for the model as well.
Imputing data comes with its own ethical challenges. We won’t do a blow by blow analysis of the techniques (there are resources below that go into more details), but the bottomline is that in trying to address a very common technical issue – missing date – we’re already facing a myriad of possible ethical implications.
Data Leakage
My earlier story about the radiology data is one of data leakage. Data leakage in machine learning refers to the form of the data “leaking” into the set of features of the data itself. In my graduate research story, it was the manufacturer of the machines used to gather the data that led to different data that had material impacts on the model predictions. Huyen tells a similar type of story about COVID data and scans of patients some of whom were lying down and others who were upright. The model learned that images of patients lying down correlated to seriously ill patients, leading the model to make predictions based on the position of the patient rather than the pertinent medical information. In another case, it was the font used to label the scans that differed between hospitals that became a defining element in the prediction. Seriously – the font mattered!
There are numerous causes of data leakage, but one cause stems from the commonly recommended practice of how to handle data for machine learning projects. In machine learning projects, it is standard practice to split the data randomly into training, validation, and test sets. However, if this random split is done for time-correlated data, there is a risk creating a data leakage issue. Sometimes the correlation to time might be obvious, as in stock data pricing tending to move in ways that are time dependent. But other times, its less obvious:
“Consider the task of predicting whether someone will click on a song recommendation. Whether someone will listen to a song depends not only on their music taste but also on the general music trend that day. If an artist passes away one day, people will be much more likely to listen to that artist.*”(Huyen, 2017)
Huyen’s advice is to incorporate time into the split when dealing with time-correlated data. That level of nuance is often not discussed in general machine learning practices but if this important detail is missed it could unintentionally result in data leakage. It’s also important to note that whatever choices are made, documenting those choices is essential in order for traceability and auditability.
These examples are from the early stages of the machine learning pipeline. We haven’t even started to use our data yet and we’ve already encountered several thorny issues that on the surface appear to be merely technical matters data preparation. The devil is in the details and dealing with the imperfections in data is the mundane work of data ethics in the guise of technical work.
More Resources
Flexible Imputations of Missing Data – an ebook that covers imputation in detail
Managing Missing Data in Analytics
* Total aside – I used to run a radio station and I remember those days when a prominent artist passed away and we did back-to-back tribute shows. Our playlist data for those days was incredibly skewed.
Send Me Your Questions!
I would love to hear about your data dilemmas or AI ethics questions and quandaries. You can send me a note at [email protected] or connect with me on LinkedIn. I will keep all inquiries confidential and remove any potentially sensitive information – so please feel free to keep things high level and anonymous as well.
This column is not legal advice. The information provided is strictly for educational purposes. AI and data regulation is an evolving area and anyone with specific questions should seek advice from a legal professional.