Ask a Data Ethicist: Why Ethically Acquired Data Is Needed for Responsible AI

One of the questions I’m often asked is how we can use AI ethically. I think this question can’t really be answered without attending to more fundamental questions about how AI is made. We need to ask …

Is It Possible to Ethically Use a Product That Was Made Unethically?

There are a number of issues related to data that are central to addressing the ethical production of AI. Without addressing those issues it feels premature to ask about our own ethical use. So, let’s take a look at those data questions.

AI’s Blood Diamond Problem

A few years ago I gave a talk at Google Devfest outlining what I called AI’s blood diamond problem, named after the 2006 film, Blood Diamond, starring Leonardo DiCaprio*. That popular film raised the profile of the unethical supply chain involved in diamond mining and how these diamonds served to fuel conflict and profit warlords. It provided a graphic display of the ways in which western consumers, who just wanted a shiny engagement ring, are entangled in bigger geo-political and economic systems. Once aware of these issues, many people did not want diamonds that contributed to horrific human rights abuses. The industry responded with ethically sourced diamonds, which included not just conflict free status, but also fair pay, safe working conditions and environmentally responsible practices took shape. However, it hasn’t been a Hollywood ending for the diamond business. Demand for diamonds overall has recently hit a low point due in part to its unethical history and GenZ’s prioritization of ethical consumer behaviour. Diamonds might have been known as a girl’s best friend – as part of DeBeers’ highly successful product marketing ploy – but priorities change.

Large language models are not the entirety of AI, but they tend to be the focus of commercial AI (at least this point in time). To build an LLM requires massive amounts of training data. That data is often copyrighted material, scraped from the internet and used without consent or compensation. Additionally, the data processing, to prepare data for AI training, involves human rights atrocities too. Data workers are often poorly paid and subjected to all kinds of traumatic imagery, in an attempt to keep ‘the worst of the worst’ out of the training dataset. These large language models also have large carbon footprints, raising environmental concerns about both their training and usage.

In this manner, LLMs are much like blood diamonds, entangled in a web of unethical, possibly illegal practices that have led to lucrative profits for those willing to engage in this form of production. If we apply the same principles to ethical development for AI, as we do for diamond mining, then the question of whether or not copyrighted data is legally and ethically permissible to build these models in the first place must be something with which we contend.

Paying for Data Wasn’t Part of the Plan

There’s a massive amount of investment being made in data centers and the (primarily Nvidia) chips necessary to process data. Yet, paying for the actual data itself was never part of the business model. This was noted back in 2024 by VC firm Andreessen Horowitz, who claimed investors would ‘go broke’ if forced to pay and by Open AI, who made the case that while copyrighted data was necessary for AI training, they were abiding by fair use, moving forward under the assumption they will not need to pay:

“Because copyright today covers virtually every sort of human expression — including blogposts, photographs, forum posts, scraps of software code, and government documents — it would be impossible to train today’s leading AI models without using copyrighted materials,” OpenAI argued in its submission to the House of Lords….They’re already operating under the assumption that they will not pay for things such as training materials, licenses or artists’ labor.” (LA Times)

The AI data economy has been predicated on the idea that data can be taken – reused and repurposed – for free. This was a practice normalized under the auspices of AI research which has then segued into the commercial realm. Take Imagenet, the dataset that fundamentally changed the way AI was developed, moving the field from a focus on better algorithms to a focus on bigger data. Imagenet was produced by scraped from the internet:

“We collect candidate images from the Internet by querying several image search engines. For each synset, the queries are the set of WordNet synonyms. Search engines typically limit the number of images retrievable (in the order of a few hundred to a thousand). To obtain as many images as possible, we expand the query set by appending the queries with the word from parent synsets, if the same word appears in the gloss of the target synset. For example, when querying “whippet,” according to WordNet’s gloss a “small slender dog of greyhound type developed in England”, we also use “whippet dog” and “whippet greyhound” (Deng et al, 2009)

Web scraping is standard practice for commercial AI development, not just academic research. It’s a sought after job skill for machine learning or software engineering roles with various AI companies. A quick search for ‘webscraping’ on Indeed illustrates this point. Here is a sample of mentions in various postings:

“Proficient in Python for development of back-end services & APIs, data processing pipelines, browser automation, web scraping”

“Experience with web scraping frameworks (e.g., Scrapy, Selenium, BeautifulSoup) and API integrations.”

“Other Skills/Experience: Web Scraping”

“Proven ability to build and maintain web scrapers, handling challenges like dynamic content, CAPTCHAs, and bot protection measures.”

ImageNet also benefitted from the digital camera and proliferation of social media enabled data sharing online. Human labor (Amazon Mechanical Turks) was hired to do the data cleansing work needed to verifying and make the dataset more usable for machine learning. These same practices have made their way from academia into the commercial world of AI training. It’s only because of recent legal challenges that some of the players have decided to settle lawsuits or strike licensing deals. Yet, as these current job descriptions indicate, there is still an awful lot of webscraping taking place.

The economics of the AI industry rely on the data being freely available for use. This means that those producing the data – authors, illustrators, artists, and other content creators – are not being compensated. In an interview with CBC’s As it Happens, Mary Rasenberger, CEO of the Authors Guild said:

“There’s a general feeling that this is unfair … The AI developers created these systems behind our backs without asking permission.” (CBC)

The whole system is built on this data economy. It’s reminiscent of the extractive, colonial history that underpins the diamond mining industry – a take-whatever-you-can attitude.

Back to Our Question

Circling back to the question of ethical use and ethical production, one way to assess the situation is to get much more specific. What AI model or tool are we talking about? If we are talking about a narrowly scoped system where data has been lawfully and ethically acquired, and there was fair compensation for all of the relevant parties involved, then we can move on to talk about using the tool. But, if we don’t have those conditions in place – which is true of most of the commercially available generative AI tools – then we need to stop and ask ourselves if we are onboard with using these tools at all, given their provenance. We need to grappled with the Blood in the Machine – as much as we did with the issue of blood diamonds.

* Interesting aside: In the film, Dicaprio plays South African soldier of fortune/smuggler Danny Archer. The field of AI has numerous South African connections from Elon Musk to Peter Thiel to David Sacks, this article unpacks that angle in greater detail.

Send Me Your Questions!

I would love to hear about your data dilemmas or AI ethics questions and quandaries. You can send me a note at [email protected] or connect with me on LinkedIn. I will keep all inquiries confidential and remove any potentially sensitive information – so please feel free to keep things high level and anonymous as well.

This column is not legal advice. The information provided is strictly for educational purposes. AI and data regulation is an evolving area and anyone with specific questions should seek advice from a legal professional.

Data and AI Ethics Courses

Explore the ethical considerations and standards implicit in the data industry and the emerging realm of AI.

(Use code DATAEDU for 25% off!)

Ask a Data Ethicist: Why Ethically Acquired Data Is Needed for Responsible AI

Is It Possible to Ethically Use a Product That Was Made Unethically?

AI’s Blood Diamond Problem

Paying for Data Wasn’t Part of the Plan

Back to Our Question

Send Me Your Questions!

Data and AI Ethics Courses

Katrina Ingram

Engineering Trust in Enterprise AI Platforms: Lessons from Real-World Data Systems

Ask a Data Ethicist: Where Does My AI “Chat” Data Go?

AI Governance in 2026: Is Your Organization Ready?

Thanks!

Ask a Data Ethicist: Why Ethically Acquired Data Is Needed for Responsible AI

Is It Possible to Ethically Use a Product That Was Made Unethically?

AI’s Blood Diamond Problem

Paying for Data Wasn’t Part of the Plan

Back to Our Question

Send Me Your Questions!

Data and AI Ethics Courses

Katrina Ingram

Related Articles

Engineering Trust in Enterprise AI Platforms: Lessons from Real-World Data Systems

Ask a Data Ethicist: Where Does My AI “Chat” Data Go?

AI Governance in 2026: Is Your Organization Ready?

Lead the Data Revolution from Your Inbox.

Thanks!