The Fundamentals of Data Anonymization and Protection

By on

Click to learn more about author David Balaban.

Data anonymization is a process aimed at eliminating personally identifiable clues so that it’s impracticable, or at least very challenging, to attribute this information to a specific individual. In this regard, the terms “de-identification” and “anonymization” denote virtually the same concept and may be used interchangeably.

The purpose of anonymizing data mainly revolves around privacy protection. It is common practice to leverage this mechanism in medical research, market analytics, and many other domains where data shouldn’t be easily traced back to a person. Whereas the legal side of the matter varies by jurisdiction, associating personal information with an individual is mostly prohibited by law unless the person knowingly agrees to it. After data is anonymized, though, the entities that have access to it can use the records with no strings attached.

The Whys and Wherefores of Anonymizing Data

The booming Internet of Things (IoT) has taken data harvesting activities to the next level. Connected devices amass huge amounts of information about pretty much everyone, and the bulk of it resides online. The mantra about non-disclosure of this data could be cold comfort for two reasons. First off, there is a risk of employees abusing their privileges to mishandle customer information. Secondly, cybercrooks are ramping up their efforts to obtain sensitive data via breaches that are hitting the headlines over and over.

Nowadays, pretty much everything you do leaves a data footprint, and this goes beyond online activities alone. When you are visiting a doctor, interacting with businesses in-store and online, surfing the web, or using desktop and mobile applications, there are “breadcrumbs” of these events spilled in the form of data. If you follow the technology trends and live in a smart home, then the internet-enabled devices around you know and retain information about your day-to-day routine, from the time of your coffee breaks to your family’s power consumption patterns.

Things are hassle-free as long as all of this data is treated in isolation from a person. However, this is an ideal scenario that hardly works once intelligent gadgets and internet services kick in. The common types of the retrieved data include your email address, IP address, name, and geolocation. In many cases, the range of personally identifiable information (PII) collected behind your back is broader than that.

Data Protection Best Practices

One of the most effective ways to safeguard your online routine from prying eyes is to use a Virtual Private Network (VPN) solution that will encrypt the traffic as it’s traveling back and forth. This approach prevents ISPs, data aggregates, and cyber crooks from harvesting your data, which means you can surf the web, use torrent clients, and run online-facing applications with peace of mind.

The truth is, most people don’t bother leveraging techniques like that. Even those who are privacy-minded and follow proper online hygiene may fail to keep their data intact in some situations. A leak of one’s medical records after a hospital visit is an example of how personal vigilance can turn out to be futile because the information is mistreated by a third party.

In this context, de-identification is a game-changer because it sets PII apart from other data. With the scourge of malware outbreaks and large-scale data breaches haunting different industries over and over, this approach makes a whole lot of sense.

Organizations may be officially required to conduct data anonymization in some cases. It could also be a mere recommendation, and companies can choose whether or not to take this route. There are also cases where anonymization is a pivot point of a business marketing strategy and a stimulus for people to entrust their sensitive data to a service without a second thought.

GDPR Paving the Way for a Privacy-First Paradigm

The overarching General Data Protection Regulation (GDPR) came into force in the European Union on May 25, 2018. Its primary objective is to reinforce the privacy rights of EU citizens. The most significant provisions of this regulation cover immediate data breach reporting, extensive consumer rights regarding the collection and access to data, and rigid data security requirements for organizations.

Data anonymization, or at least its subtype known as pseudonymization, which will be described further below, is one of the key recommendations listed in the GDPR. It’s noteworthy that the regulation elaborates on the concept of PII, complementing the conventional name, physical address, and phone number with extra details such as IP addresses and electronic signatures. Furthermore, compliance with the GDPR is mandatory for any non-EU company that handles data of EU residents or businesses.

Other privacy laws comparable to the GDPR include the Canadian Personal Information Protection and Electronic Documents Act (PIPEDA) and the Privacy Act highlighting Australia’s privacy principles. There is no nationwide analog in the United States. The California Consumer Privacy Act (CCPA) bears a resemblance to the GDPR in some ways, but it’s a state-level regulation rather than federal law. That being said, there are specific laws in effect that aim to guard the privacy of US citizens in different areas. The Health Insurance Portability and Accountability Act (HIPAA) fits the mold of this legislation, serving as a privacy roadmap that pertains to medical information.

The Logic of Data Anonymization

There are plenty of techniques to mask data. Therefore, companies don’t have to reinvent the wheel when trying to give their data privacy practices a boost. The caveat is that some of these mechanisms have limited efficiency. All in all, the most common methods are as follows:

  • Aggregation: Data is accumulated in a de-personalized form. For instance, people’s ages may not be logged as such, but the total number of individuals of a specific age is known. This is what the personal data sold by businesses often looks like.
  • Encryption: Whereas this approach doesn’t eliminate personal details from a set of data, it renders the information unreadable without the decryption key. This way, it cannot be abused even if it ends up in the wrong hands.
  • Generalization: An example of this technique is the use of age ranges rather than specific ages or narrowing down phone number details to the area codes.
  • Hashing: As is the case with encryption, this one doesn’t involve the removal of fingerprintable fragments of data. Instead, it substitutes them with random-looking hash strings.
  • Pseudonymization: The idea behind this method is to replace personally identifiable elements of data entry with artificial records known as pseudonyms. Whereas anonymization is an irreversible process, pseudonymization allows for re-identifying the cloaked data via specific clues at a later point.
  • Perturbation: This one comes down to slightly altering PII values. It isn’t suitable for scenarios requiring accurate data, though.
  • Randomization: Personal data fields are deliberately skewed or substituted with random attributes.
  • Suppression: This technique “strips” data by completely obliterating personally identifiable values from it.

There is no one-size-fits-all technique. The optimal choice depends on a range of factors, including the peculiarities of local legislation, privacy guidelines at the industry level, as well as the type and intended use of the data to be anonymized. Some organizations take a shortcut by outsourcing this task to specially crafted anonymization tools such as IBM Security Guardium or Oracle Advanced Security.

A fairly simple yet effective way to anonymize data is to leverage the IP anonymization module provided by Google Analytics. Website owners can opt for this instrument to align their projects with privacy laws such as GDPR.

De-Anonymization at a Glance

Even after data has been anonymized, there could be workarounds to unveil an individual’s identity. Some of the mechanisms listed above can be reversible under certain circumstances. For instance, hashed data records can be restored to their original state by making recursive attempts to guess them until a matching hash is identified.

Even in the case of suppression, which is deemed the most effective anonymization tactic, PII might still be revealed by cross-referencing the remaining information with other sets of data. This process is known as de-anonymization. On a side note, it’s always a matter of trial and error, and there is no guarantee that it will yield the expected results in any given situation.

In theory, data encryption isn’t a rock-solid barrier to de-anonymization either. If it’s crudely implemented or if the cipher isn’t strong enough to thwart brute-forcing, well-motivated individuals may be able to crack it and retrieve the information. However, with many present-day cryptographic algorithms being extremely reliable, encrypted data is generally safe.

Leave a Reply