Elon Musk recently stated that we have exhausted all human data available in today’s market for training AI models. Musk isn’t the first to notice that we need more data sources if artificial intelligence is to continue its rapid progress. This is felt particularly in industries like healthcare and finance, where tight privacy regulations are making the shortage of data even more acute.
The answer many are pointing to is synthetic data, which isn’t new; interest in it is growing, as evidenced by the recent increase of mergers and investments in this field. With this refreshed interest comes new uncertainties around the use of such data, notably the risk of model collapse. This is where the quality of a multimodal large language model’s (LLM) output deteriorates without real-world data to train on.
What Is Synthetic Data?
Synthetic data is artificially created rather than collected from actual sources. The most common form is AI-generated, which is created by training models on real-world data to teach it common patterns, then generating new data that mimics these statistical properties.
LLMs are being used to generate a breadth of synthetic data types, including structured data (including tabular data) and unstructured data (including free texts, videos and images). A range of output methods are used, depending on the type of data being produced.
Momentum growth for this type of data In the past five years, the rapid development of LLMs has boosted both demand for synthetic data generation and the means of generating it at scale. As a result, synthetic data usage has skyrocketed.
Microsoft’s Phi-4 model (which outperforms other LLMs despite being much smaller) was successfully trained on mostly synthetic data. Meanwhile, the engineers of Amazon’s Alexa are exploring the use of a teacher/student model, which involves the “teacher” model generating synthetic data, which is then used to perfect a smaller “student” model.
This increase in use is being reflected in money spent in the market as well. The synthetic data sector saw an investment boom in 2021-22 alone. More recently, the trend has been towards large-scale acquisitions. NVIDIA’s acquisition of Gretel this spring is a notable example.
The analytics firm Cognilytica estimated the market for this data generation in 2021 to have been worth around $110 million. The firm expects it to reach $1.15 billion by 2027. This is a market that is speeding up, not slowing down.
Model Collapse
However, data’s exciting potential comes with a crucial downside: model collapse. While original data tends to be highly complex, synthetic data is often simplified and condensed by models. A recent study by academics from Oxford, Cambridge, Imperial College and the University of Toronto found that using model-generated data without care caused permanent problems in the model.
On top of this, most LLMs are “black boxes,” making it difficult to understand how they will respond to this new dataset. Researchers from Rice University and Stanford concluded that without some fresh real-world data, “future generative models are doomed to have their quality (precision) or diversity (recall) progressively decrease.”
The Need for Real-World Inputs
Even with the rise in demand for this type of data, the need for real-world data remains. Demand for high-quality real-world data may even increase. The reason for this is two-fold.
Firstly, real-world data will always be needed in order to train the AI models that then generate this type of data. And secondly, in order to avoid model collapse, it is necessary to continually sync synthetic data with original data.
The Role of Real Data in Training Synthetic Data-Producing AI Models
As mentioned above, the majority of synthetic data generation today is created through generative AI and these models must be trained on real-world data to create coherent synthetic data. That is because synthetic data must replicate the patterns and statistical properties of a real-world dataset.
Mitigating Model Collapse with Real Data
There are quite a few strategies for mitigating the risk of model collapse. These include validating and then regularly reviewing synthetic datasets, alongside checking the quality of synthetic data before it is used in generative AI. However, the most common approach is to diversify the data used by combining synthetic data with human data. Gartner’s survey found that 63% of respondents favor using a partially synthetic dataset, with only 13% saying they use fully artificial data. Data quality also counts, as illustrated by Microsoft’s success with Phi-4. This LLM was trained largely on synthetic data generated by GPT-4o. However, much of the pre-training data – a general dataset used for the first stage of training before a model is fine-tuned – was carefully curated, high-quality real-world data, including books and research papers.
The Positive Impacts Synthetic Data Can Have
When this data is used responsibly and intellectually, of course combined with the use of real-world data, it has the power to solve six key issues concerning the training of AI data. These are scarcity, accessibility, homogeneity, bias, data privacy issues, and cost.
Real Data Scarcity
As AI companies fight to gain market share and achieve new firsts, the demand for data to train their LLMs only increases. This form of data can partly cover this demand, however, significant amounts of real data in pre-training datasets, and for syncing to avoid model collapse, will still be needed.
Accessibility
Synthetic data has the potential to democratize generative AI models, meaning it doesn’t become a race between several big tech companies, by making large volumes of training data affordable and accessible. However, this will not remove the responsibility of big tech to improve access to real-world data, as it is still needed for training synthetic-data-creating models.
Homogeneity
In certain cases, such as training AI for driverless cars – also known as autonomous vehicles – real-world datasets are too similar and subsequently could not leave room for error or unusual circumstances to occur on the off chance. In the case of driving, developers can generate synthetic data to fill gaps in the data for unusual situations. This then enables models to train for rare occurrences on the road.
Bias
Another key area to consider is issues concerning bias – something systemically evident in real-world datasets. Here, synthetic data generation tools can be used to ensure AI models receive a more balanced picture.
A 2021 study found that using synthetic face images to augment biased facial recognition datasets allowed researchers to reduce the required size of real-world training data by up to 75% for face recognition and by 50% for facial landmark detection, while maintaining performance levels. This demonstrates that such data not only helps correct demographic imbalances in underrepresented groups but also boosts efficiency in training models.
Privacy
For high-security sectors, including healthcare and finance, data privacy requirements are further exacerbating data shortages. With synthetic data, companies can train datasets for their models containing niche data without putting customer data privacy at risk. However, as a report commissioned by the UK’s Royal Society has pointed out, there is an assumption that synthetic data is “inherently private.” This is a misconception. As the researchers point out, this data can leak information about the real data it was derived from.
Cost
Generally speaking, synthetic data is generated at a lower cost than real-world data. It also comes labelled, which saves time and costs once purchased. On some AI training projects, up to 80% of the project is taken up with data preparation, such as labeling. This explains why dedicated companies have emerged specifically to source low-cost labor to meet the data processing needs of Silicon Valley giants.
Data Augmentation
The varied benefits of synthetic data can be leveraged, provided it is not treated as a replacement for real data. Instead, its role should be to augment real datasets, providing ways to increase the scale of data points available.
For context, Meta’s upcoming LLM, LLAMA Behemoth, is being trained on 30 trillion data points. Clearly, finding real-world data at this scale is challenging, if not impossible.
Still, real-world data remains essential, both for training the models that generate synthetic data and for calibrating synthetic datasets to maintain accuracy and prevent model collapse. Even as synthetic data takes on a larger role, the sheer scale of today’s LLMs ensures strong ongoing demand for authentic data.

