AI's Most Pressing Constraint Has Nothing to Do with Chips or Compute
For years, the AI industry operated on a deceptively simple formula: more data equals better models. Companies scraped the internet, digitised libraries, and licensed enormous datasets to feed ever-larger neural networks. The results were extraordinary. But a strange and underreported problem has emerged at the heart of this progress. The world is running out of usable training data.
By The Numbers
- Estimated total text on the public internet: roughly 250 billion pages, yet only a fraction meets quality thresholds for AI training
- Annual growth of new web content: approximately 5 to 7 per cent, far below the doubling rate of model parameter counts
- Projected data exhaustion timeline: high-quality English text may be effectively depleted for training purposes by 2028
- Synthetic data adoption: over 60 per cent of leading AI labs now use some form of machine-generated training data
- Asia-Pacific language data gap: training corpora for languages such as Thai, Vietnamese and Bahasa Indonesia remain 10 to 50 times smaller than their English equivalents
Why the AI Training Data Well Is Drying Up
The core issue is deceptively straightforward. Large language models learn by ingesting vast quantities of text, images, and code. Each generation of model demands significantly more training data than the last. GPT-3 trained on roughly 300 billion tokens. Its successors required trillions. The exponential appetite of these systems has comprehensively outpaced the linear growth of the internet.
Researchers at Epoch AI have published findings suggesting the stock of high-quality text data could be fully consumed within a few years. Low-quality data remains abundant, but feeding it into models introduces noise, bias, and degraded performance. The distinction between quantity and quality has become the central tension in AI development today.
"The data bottleneck is not a theoretical concern. It is the most immediate constraint on scaling the next generation of foundation models." - Epoch AI Research
This matters far beyond the laboratories building these systems. If the AI training data bottleneck cannot be solved, the pace of improvement in every AI-powered product, from translation tools to medical diagnostics to financial modelling, will slow. The implications are global. But as this article will show, they are especially acute across Asia-Pacific.
The Synthetic Data Gamble
Faced with scarcity, labs have turned to a controversial solution: synthetic data, which is training material generated by AI models themselves. The logic is appealing. If you cannot find enough real-world data, create artificial substitutes that mimic its statistical properties.
Companies including Nvidia, Google DeepMind, and several Chinese labs have invested heavily in synthetic data pipelines. Early results are mixed. Synthetic data works well for narrow tasks such as code generation and mathematical reasoning. For open-ended language understanding, however, models trained primarily on synthetic data can develop subtle distortions, a phenomenon researchers call model collapse.
Model collapse occurs when AI-generated content feeds back into training loops, gradually amplifying errors and reducing diversity of expression. It is the machine learning equivalent of photocopying a photocopy: each generation loses fidelity. The risk is not hypothetical. Several published studies have demonstrated measurable degradation in models trained through multiple generations of synthetic content.
The table below summarises the key trade-offs between real-world and synthetic training data approaches currently debated in the research community.
| Approach | Strengths | Weaknesses | Best Used For |
|---|---|---|---|
| Real-world data | Diverse, authentic, grounded | Finite, legally contested, expensive | General language understanding |
| Synthetic data | Scalable, controllable, cheap | Model collapse risk, low diversity | Code, maths, narrow tasks |
| Federated learning | Accesses private data without centralising it | Complex infrastructure, slower | Healthcare, finance, government |
| Active learning | Reduces data volume needed | Requires expert annotation | Specialised domains |

The Copyright Minefield Complicating AI Training Data
Data scarcity has intensified legal battles over training data. Publishers, news organisations, and creative professionals worldwide have launched lawsuits against AI companies for using copyrighted material without permission or payment. These cases are reshaping what data can legally be used and at what cost.
The consequences are significant. If courts consistently rule against AI companies, vast swaths of the internet's highest-quality content, think journalism, academic writing, literary works, will be placed behind licensing walls. That would accelerate the data scarcity problem considerably.
- United States: Multiple ongoing cases from publishers including The New York Times against OpenAI and Microsoft
- European Union: The AI Act includes provisions requiring transparency about training data sources
- Japan: Initially adopted a permissive fair-use stance; now softening under pressure from domestic content creators
- South Korea: Actively debating licensing frameworks for commercial AI training
- China: Rules requiring training data legality are in place, though enforcement remains uneven
"Every major AI company is now in a licensing race, trying to secure exclusive access to high-quality data before competitors lock it up or regulators restrict it." - AI training industry observation, widely reported
This licensing race has created a new category of strategic asset. Organisations sitting on large, high-quality, legally clean datasets, whether hospitals, financial institutions, or governments, suddenly find themselves holding considerable leverage. That dynamic is playing out with particular intensity across Asia-Pacific.
The Asia-Pacific Picture on AI Training Data
Asia sits at a unique crossroads in the AI training data debate. The region produces enormous volumes of digital content daily, from social media posts and e-commerce transactions to government records and academic publications. Yet much of this data remains siloed, unstructured, or legally inaccessible for AI training purposes.
The data shortage hits differently across the region. English dominates existing training corpora, leaving models significantly weaker in languages spoken by billions. Thai, Bahasa Indonesia, Vietnamese, Tagalog, and dozens of other languages have far less digitised text available. This creates a two-tier AI landscape: users in English-speaking markets receive cutting-edge performance, while those across Southeast Asia, South Asia, and parts of East Asia receive models that struggle with local context, idiom, and cultural nuance.
Several regional initiatives are attempting to close the gap, with varying degrees of resource and ambition.
- Singapore: AI Singapore has funded multilingual dataset creation programmes targeting Southeast Asian languages
- Indonesia: The government has partnered with local universities to build large-scale Bahasa Indonesia corpora
- India: Researchers are assembling datasets across Hindi, Tamil, Bengali, and other major languages under initiatives including Bhashini
- China: State-directed data-sharing initiatives have produced substantial Mandarin corpora, though within a tightly controlled ecosystem
- Japan and South Korea: Both possess rich digital archives but face cultural and legal barriers to releasing them for AI training at scale
These efforts remain modest compared to the resources available to major Western and Chinese labs. The gap is not purely financial. Southeast Asian nations are data-rich in raw terms but frequently lack the infrastructure to curate and prepare datasets at the quality levels modern models require. For more on how this imbalance shapes everyday AI usage across the region, see our deep dive on how people across Asia-Pacific actually use AI tools in 2025.
China's approach deserves particular attention. As covered in our analysis of China's five-year AI technology strategy, Beijing has made state-coordinated data access a centrepiece of its national AI competitiveness plan. Chinese labs including Baidu, Alibaba, and newer entrants have access to datasets that are simply unavailable to foreign competitors, giving domestic models a structural advantage in Mandarin and selected regional languages.
The countries and companies that resolve the data access problem first will hold a decisive advantage in the next phase of AI development. Data is becoming the new strategic resource, and Asia's fragmented approach to data governance could either accelerate or critically hinder regional AI ambitions.
What the Industry Is Doing About It
The search for solutions extends well beyond synthetic data and licensing deals. Researchers are pursuing several technically distinct approaches, each with different implications for who benefits.
Federated learning allows models to train on distributed data without centralising it, potentially unlocking private datasets held by hospitals, banks, and governments. This is particularly relevant for Asia, where data localisation laws in countries such as India, Indonesia, and Vietnam make cross-border data transfers legally fraught.
Active learning techniques help models identify and request only the most informative training examples, reducing total data requirements substantially. Architectural innovations are also emerging: Google DeepMind's Gemini and Anthropic's Claude have both demonstrated improved data efficiency compared to earlier model generations, extracting more value from the same volume of training material.
The question is whether efficiency gains can keep pace with the ambitions of the industry. For a broader view of how frontier AI labs are adapting their strategies, our coverage of why users are switching to Claude explores how data efficiency has become a genuine competitive differentiator.
There is also a harder question that fewer people are asking: what happens to AI capabilities if the data problem is not solved? The risk is not that AI stops working. It is that progress plateaus at a moment when enormous investments have been made on the assumption of continued improvement. That is a scenario with serious consequences for every business and government that has built its AI strategy around perpetual capability gains. For context on the scale of infrastructure being deployed in anticipation of that growth, see our report on floating data centres being deployed to tackle the AI energy crisis.
Frequently Asked Questions
What does the AI training data bottleneck actually mean in practice?
It means the stock of high-quality, publicly available text, images, and other media that can legally and effectively be used for training AI models is approaching its limits. Models need fresh, diverse data to continue improving, and the supply is not growing fast enough to match demand from increasingly large model architectures.
Can synthetic data solve the AI data shortage?
Partially, and under specific conditions. Synthetic data works well for narrow tasks such as coding and mathematical reasoning, but carries significant risks of model collapse and reduced linguistic diversity when used as the primary training source. Most researchers and labs treat it as a supplement to real-world data rather than a wholesale replacement.
How does the AI training data problem affect Asia-Pacific specifically?
Asian languages are disproportionately affected because far less digitised, high-quality text exists in languages like Thai, Vietnamese, and Bahasa Indonesia compared to English. Training corpora for these languages are 10 to 50 times smaller than their English equivalents, meaning AI models perform materially worse for hundreds of millions of users across the region. This gap will widen unless targeted investment in multilingual dataset creation accelerates significantly.
Given how much AI investment across Asia-Pacific depends on the assumption of continued model improvement, what would a genuine data plateau mean for your organisation's AI roadmap? Drop your take in the comments below.










No comments yet. Be the first to share your thoughts!
Leave a Comment