Data Audit in the Age of Machine Learning: Goals and Challenges

*Read more about author Andrea Di Stefano.*

Despite its many benefits, the emergence of high-performance machine learning systems for augmented analytics over the last 10 years has led to a growing “plug-and-play” analytical culture, where high volumes of opaque data are thrown arbitrarily at an algorithm until it yields useful business intelligence. What does this mean in terms of data audit? Let’s discuss it.

Data Audit and the Black Box Problem

Due to the black box nature of a typical machine learning workflow, it can be difficult to understand or account for the extent of the “dark” data that survives these processes; or the extent to which the unacknowledged provenance or unexplored scope of the data sources could legally expose a downstream application later on.

This raises several questions:

What are the implications of machine learning’s puzzling nature for data audit?
Did the data pass through jurisdictions that encumber the enterprise with legal obligations in regard to its storage?
Are the data’s evolving schema and origins sufficiently well-understood to placate the concerns of partners, or to satisfy the “due diligence” phase of a buy-out?
Is its opacity a potentially fatal liability in the face of coming regulatory standards that did not exist when the data was first introduced?

Here, we’ll look at possible answers to some of these questions, while clarifying the reasons behind data audit and defining some guidelines to deal with data audit in the field of AI and machine learning.

Goals of Data Audit

In most jurisdictions, a data audit is not currently an official and prescribed event. Rather, it’s a process that may involve varying standards of transparency and disclosure.

Though the objectives for a data audit may vary depending on whether the audit is being conducted for compliance (external demands) or performance (internal, commercial review of processes), either type of audit is a worthwhile opportunity to tune your data-gathering and governance procedures and policies, and to take both sets of needs into consideration.

Therefore, some of the objectives of a data audit may include:

The use of untapped data resources to develop new processes
The reduction of a company’s storage burden by identifying non-actionable and legally irrelevant data
The need to comply with regulations (such as privacy policies) and license terms (including “fair use” clauses), therefore preventing legal liabilities
The identification of non-indexed material, with a view to developing a forward plan for it (such as deletion, governance requirement evaluation, or general indexing)
The detection and removal of malicious data while securing the channels and protocols that allowed it in
The establishment of workflows for automatically handling data anomalies in future audits (for example, if non-compliant or inadequately tagged data triggers a manual alert)

Shedding Light on Source Data

By nature, machine learning algorithms absorb and obscure their data sources (datasets), defining desired features to be extracted from a dataset and generalizing those features in the latent space of the training process. The resulting algorithms are therefore representative and abstract and generally considered incapable of explicitly exposing their contributing source data.

However, reliance on this automatic obscurity is increasingly coming under challenge from recent methods to expose source data from algorithmic output, such as model inversion.

The Role of Model Inversion

Model inversion techniques are proving capable of disclosing confidential information that was intended to be protected by the way that machine learning models “abstract” source data. It covers a variety of techniques that make it possible to poll an AI system and piece together a picture of the contributing data from its various responses to different queries.

This includes uncovering the “weights” of a model, which often represent the intrinsic value of a machine learning framework. Indeed, if the weights were generated by material that later becomes IP-locked and can be mapped (i.e., their use of copyrighted data exposed) by model inversion, it will not matter if the current dataset is impeccable from a governance standpoint.

Three Data Audit Scenarios

Considering the above, auditing your data assets to ensure compliance standards in reasonable anticipation of possible third-party audits at a later date becomes an absolute priority. In this regard, let’s examine three possible relevant scenarios:

FOSS datasets: If your analytics system has used a free or open-source (FOSS) dataset and a restrictive change in license occurs, any software (including machine learning algorithms) unwittingly developed with IP-locked data will be subject to restrictions as well. So, you should always assess the long-term viability of the license and the data. Another potential risk to consider is the use of a FOSS dataset whose provenance and IP integrity are later challenged by third parties that lay claim to the data.
Synthetic datasets: This represents an increasingly popular approach to data generation that includes artificially produced text or CGI-generated imagery. It’s worth being aware too of the provenance of the information in a synthetic dataset that you did not create yourself. Are all its contributing data sources publicly disclosed and available for inspection? Can you follow its entire chain of creation back to the first source and be satisfied with the validity and perpetuity of the license terms?
Proprietary datasets: Generating your own dataset is the safest possible way to develop unassailable source data, but also the most expensive and time-consuming solution. That’s why several companies take advantage of current lax regulations around data scraping and exploit online material that a domain might prohibit for such use. However, things may change in the future, leading to disputes defined in the legal arena. Therefore, it’s prudent to anticipate this when designing long-term data extraction, custody, and governance policies.

A Catalyst for Data Audit

In this period, the model inversion sector is fueled by a growing crusade around data privacy and AI security.

Indeed, the history of patent trolling over the last 30 years suggests that researchers’ free ride on public data will come to the attention of copyright enforcers over the next 10 years as national AI policies mature, and that growing data transparency requirements will coincide with the capabilities of model inversion to expose data sources.

Data Topics