How to Discover Personal Data Across Your Systems

*Read more about author Vladimir Stepanov.*

Monitoring personal data across your systems has become a necessary evil, with privacy compliance creeping up on the agenda every year. While you can find numerous solutions that provide visibility into the data you store, there are three main methods of tracking personal data: surveys, scanning, and HTTP proxy implementation.

But which of these techniques is best for your organization?

Surveys

Contrary to some of the marketing materials for the scanning software out there, surveys can be a cheap and effective way of discovering what personal data you hold and in which systems. This method suits smaller companies that don’t use many different surveys and databases.

Carrying out surveys to discover personal data across your systems is as simple as it sounds. You just need to send out an email to all relevant stakeholders to ask them what:

SaaS they are using
Databases they use

In a small company, chances are, SaaS use is more pervasive from a data standpoint than databases, and this can make it hard to keep tight control over personal data.

Moreover, taking stock of what data is there means the data owners need to have good knowledge of their data practices. However, larger companies will need to have perfect knowledge of their data inventory in order to be compliant with privacy regulations.

This can be a big downside to surveys if perfect information is needed: Data owners need to sift through everything, which can take time and mistakes can be made.

Still, surveys offer a primitive way to get a birds’-eye view of the data in the company and are a quick and cheap solution in data discovery.

Scan

While conducting surveys can be a chore for those tasked with doing so, there are also software tools that can take care of the manual work for you.

These tools operate on a kind of set-it-and-forget-it basis. They work similarly to an anti-virus scanner, whereby you launch the scan, go about your business, and then come back to it when the scan is done.

Just like anti-virus scanners, the time it takes to complete a scan of all systems depends on the amount of data you store. Moreover, if you are using cloud-based tools (SaaS), the scanners will need to have integrations to sift through what you are storing off-site.

This can be a problem if you have a lot of data to go through, or are using SaaS that needs custom integrations to be scanned. Conversely, if you want detailed insight into everything you hold and where, have the time to do a full scan, and are using popular software tools, this is probably the solution for you.

In 2016, IDG reported that the average company stored 162.9TB of data. This would take a scanner around 70 days to process if 80% were unstructured data. Imagine how long that would take now.

While machine learning is helping scanners get faster, if you have auditors at your door and need to have an inventory fast, you would have to employ multiple scanners at a large cost, not to mention that more advanced scanners generally cost more.

It’s also worth mentioning the invasive nature of these scanning tools. In order for them to audit all of your data to discover what personal data you hold, you need to authorize access to many (if not all) of the places you store data.

HTTP Proxy

The last method to discover personal data is by monitoring it with an HTTP proxy. This method involves using a standard proxy that receives traffic and forwards it to another service that performs another analysis.

In a nutshell, it works by rerouting traffic that is then forwarded to an analyzer to extract personal information, which is then reformed as a metadata record. This metadata is then sent to a dashboard.

Think of it as a kind of traffic police looking for personal data that flows via API calls.

Unlike scanning, which runs through your systems processing data at rest, proxy data monitoring processes data in motion, meaning that it isn’t invasive to systems like scanning. However, since it manages only the data in motion, discovery can happen only if personal data is in transit and passes through the proxy. Also, the proxy can add additional latency to data flows, as it is effectively an additional hurdle that the data has to jump before reaching its destination.

Proxy systems scan data as it is being used, recognizing (using machine learning) and classifying the parts that are personal data. This can be helpful when trying to uncover risky data practices because you can see where the data has come from and where it is going.

It is worth noting, though: Just like the above solutions, the proxy works asynchronously with data flows but is implemented as fast as possible so as not to harm latency and throughput. Failsafes can be added to “switch off” the proxy if it is causing data traffic jams.

If the service is able to sit on volumes of traffic high and long enough, it is able to say how many systems there are on the backend. This means there is no need to scan – just monitor traffic to know which systems are there.

Which Is Best?

In all honesty, it depends on what your needs are, the size of your company, and the amount of data you store. I’ve broken down the pros and cons of each method in the table below.

Looking for a low-budget solution in an environment with relatively little data and few systems? Choose survey or scan.

Looking for an in-depth view into everything you have ever collected? Choose scan.

Looking to monitor data as it flows around your systems? Go for HTTP proxy.

LISTEN NOW: MY CAREER IN DATA PODCAST

Data Topics

How to Discover Personal Data Across Your Systems

Surveys

Scan

HTTP Proxy

Which Is Best?

Leave a Reply Cancel reply