Six Questions About Synthetic Data

Click to learn more about author Dr. Sigal Shaked.

What is synthetic data, and why are more and more companies turning to it as an alternative to the real thing? Here’s what you should know about the benefits and uses of synthetic data.

What is the buzz about synthetic data?

Organizations often lack enough data to run AI/ML models used to generate customer insights, optimize operations detect fraud, and more. Data collection is often expensive and time-consuming, and privacy regulations can prohibit organizations from using the data they collect.

Where is synthetic data used?

AI/ML models are becoming the norm across many industries, but they are the most pervasive in the financial community. By ingesting raw information in large data sets, understanding patterns and correlations, and drawing inferences, machine learning insights improve trading performance, streamline processes, reduce risk, and improve customer service. They are in high demand for specialized applications such as KYC (know your customer), NBO (next best offer), and risk management.

Which problems does synthetic data solve?

AI/ML models are starved for data. Linear algorithms need hundreds of examples per class, while more complex algorithms need tens of thousands to millions of data sets. The available data can also be biased. A machine learning model will make assumptions based on whatever data it reads. If that data tells a skewed or incomplete story, the rules it creates will be fundamentally unsound. Even if the data is safe and representative of every segment of the population, it can still be unusable because it’s incomplete, irrelevant, or out of date. Many enterprises have data inconsistencies because data resides in silos in different regions, business units, and geographies.

How is synthetic data preferable to masking or other options?

Anonymization techniques, like data generalization, pseudo-anonymization, data masking, or perturbation blur the data, making it less accurate for analysis since the data loses important characteristics. In addition, hackers can easily reconstruct the original data by using external information.

On the other hand, synthetic data is safer because it can’t be reversed back to an original record. The synthetic data is constructed following the same characteristics as the original data without revealing customer identities.

How is synthetic data generated?

Algorithms distinguish important features within the original data and synthesize new data that preserves the main behavioral features identified.

Data synthesis includes three components: metadata discovery that extracts elements while preserving data integrity for multi-table data sources, a generative model trainer that generates data while retaining the behavioral features of the original production data without violating personal privacy regulations, and the synthesizer that validates the data based on predefined quality measures.

What is the future of synthetic data?

I believe that obtaining the data you need will be simple in the future. Data synthesis will be as easy as copy-paste. Generating data for innovation will be embedded in the development process, taking place behind the scenes. Obtaining data will no longer be an obstacle for AI/ML models but will be a plentiful resource, enabling them to generate powerful insights.

Data Topics