Skip to main content

Cookie Consent

We use cookies to enhance your browsing experience, serve personalised ads or content, and analyse our traffic. Learn more

Install AIinASIA

Get quick access from your home screen

Install AIinASIA

Get quick access from your home screen

AI in ASIA
AI data scarcity
Business

Running Out of Data: The Strange Problem Behind AI's Next Bottleneck

This feature explores the growing challenge of data scarcity in artificial intelligence, revealing why high-quality, domain-specific data is becoming a bottleneck. Through insights from enterprise leaders, we look at synthetic data, proprietary pipelines, and the open vs closed model debate - all through a commercially grounded lens aimed at AI professionals across Asia.

Anonymous6 min read

AI Snapshot

The TL;DR: what matters, fast.

AI's biggest constraint is data, not chips, compute, or architecture, due to a "data wall" being hit in many domains.

The challenge of data scarcity in AI is not a total absence of data, but rather a shortage of high-signal, representative, legally usable data in particular domains.

The future requires robust data pipelines and governance as models will be continuously tuned, with data access and smart utilization being more important than open vs. closed model debates.

Who should pay attention: AI developers | Data scientists | Machine learning engineers | Researchers

What changes next: Techniques for synthetic data generation will rapidly evolve.

When the skies look limitless, perhaps it’s the earth that’s running out — or maybe we just haven’t learned to dig deeper.

Data scarcity in AI is less about total paucity, more about usable, domain‑specific, high‑quality data

The core challenge is not supplying more data but leveraging existing data (including private data) more intelligently

Synthetic data, rephrasing, and distillation techniques will be essential — though they carry risks (e.g. model collapse)

The future is dynamic: models will be tuned continuously, so building robust data pipelines and governance is non-negotiable

Open vs closed model debates matter less than data access, control, and smart utilisation

Let’s start with a provocation: the biggest constraint in AI today might not be chips, compute, or architecture it might be data. We often assume that because the internet is vast, there’s more than enough material to feed every new model. But what if, in many domains, we are already bumping into a “data wall”?

That’s precisely the conversation unfolding in expert circles. The “data scarcity” argument feels counterintuitive in an age of data abundance. Yet the nuance matters: it’s not whether data exists, but whether useful, high‑quality, domain‑specific, and legally accessible data exists. And on those fronts, cracks are appearing.

What Is Data Scarcity — and Why It Matters

When AI practitioners speak of data scarcity, they rarely mean a total absence of data. Rather, the challenge is the shortage of high‑signal, representative, legally usable data in particular domains.

“Data scarcity in AI refers to the insufficient availability of high-quality training data, hindering the development of effective machine learning models and leading to reduced AI performance.” — Midhat Tilawat, AI technologist at All About AI

“Data scarcity in AI refers to the insufficient availability of high-quality training data, hindering the development of effective machine learning models and leading to reduced AI performance.” — Midhat Tilawat, AI technologist at All About AI

Traditional machine learning has long contended with related themes: the curse of dimensionality, underfitting versus overfitting, bias and variance tradeoffs. In effect, many of the same tensions reappear in modern deep learning.

This scarcity manifests most sharply when building AI systems in narrow or specialised domains, or for small languages or niche verticals. In such cases:

Models may lack enough training examples to generalise safely,The cost (monetary, legal, logistical) of collecting and cleaning data escalates,The trade‑off between “quantity” and “quality” becomes brutal

Some analysts have even declared that “the Internet is a vast ocean of human knowledge, but it isn’t infinite,” and that AI researchers have “nearly sucked it dry.” — Nicola Jones, Journalist at Nature

Some analysts have even declared that “the Internet is a vast ocean of human knowledge, but it isn’t infinite,” and that AI researchers have “nearly sucked it dry.” — Nicola Jones, Journalist at Nature

Open vs Closed: Where Data Lives, and Who Controls It

A revealing moment came in a panel at the Stanford “Imagination in Action” conference. Marcie Vu (Greycroft) and Ari Morcos (Datology), speaking with Julie Choi (Cerebras), spent much of their time unpicking the logistics of data pipelines, ownership, and the trade‑offs between open and closed systems.

“Two years ago … there was a widely held belief that closed source models would be just so much better … that there was no chance to compete.” — Ari Morcos, Co-founder at Datology

“Two years ago … there was a widely held belief that closed source models would be just so much better … that there was no chance to compete.” — Ari Morcos, Co-founder at Datology

Morcos noted that beliefs around open source lagging behind have softened. But he stressed that even more than architecture, it’s the way data is handled filtered, sequenced, curated that will separate successful systems from brittle ones.

Vu reinforced this point from the investor lens, suggesting that startups “AI-accelerated” by proprietary data strategies could outperform even technically superior competitors.

Synthetic, Augmented, Rephrased: Stretching the Data You Already Have

If good data is scarce, the next best bet is stretching what we already have. This is where techniques like rephrasing and synthetic generation come into play.

Synthetic Data

Synthetic data can help fill gaps in training, balance skewed datasets, or enable safer training in sensitive sectors like healthcare or finance.

Yet it’s not without risk. If synthetic data is derived from model outputs, we risk model collapse, where systems re-learn their own limited understanding without fresh insight.

“You can only ever teach a model something that the synthetic data generating model already understood.” — Ari Morcos, Datology

“You can only ever teach a model something that the synthetic data generating model already understood.” — Ari Morcos, Datology

Rephrasing and Augmentation

“Rephrasing” involves restructuring existing data to yield new inputs. It’s cheaper and often safer than full synthetic generation. Morcos explained how their tooling enables companies to take internal proprietary data and reformat it at scale for AI readiness — affordably.

“Rather than that just being a synthetic dataset, you can now feed in your own data and have that be augmented and rephrased at scale, in a really effective way.” — Ari Morcos

“Rather than that just being a synthetic dataset, you can now feed in your own data and have that be augmented and rephrased at scale, in a really effective way.” — Ari Morcos

What the Future Holds: Continual Learning, Governance, and Enterprise Intelligence

The panel closed with reflections on the near-term future. Vu mentioned applied robotics; Morcos pointed to a coming wave of continual fine-tuning, where models evolve in real-time from incoming data.

This implies a new AI operating model:

Real-time ingestion, processing, and validation,Continuous updating and governance of models,Heavy focus on data security, especially in enterprise contexts

As AI spreads across sectors in Asia — from logistics in Vietnam to healthcare in Singapore — enterprises must answer a hard question: Do we truly know how to use the data we already have?

The real AI breakthroughs won’t come from just bigger models or cheaper GPUs. They’ll come from firms that build the culture, infrastructure, and trust to harness their own data. For instance, understanding how AI recalibrated the value of data is crucial. This shift also impacts how businesses approach generative AI adoption and manage their existing information.

Is your organisation making the most of the data it already has? Or are you staring at a goldmine and waiting for someone else to dig? To stay competitive, businesses need to adapt, much like considering what every worker needs to answer: What is your non-machine premium?.

What did you think?

Written by

Share your thoughts

Join 2 readers in the discussion below

This is a developing story

We're tracking this across Asia-Pacific and may update with new developments, follow-ups and regional context.

Latest Comments (2)

Priya Ramasamy@priyaram
AI
13 November 2025

legally usable data" is a big one for us in malaysia. with all the data residency laws and internal compliance, getting access to some public datasets, let alone proprietary ones from partners, is a nightmare. it's not just about quality, it's about what we're even allowed to touch here.

Miguel Santos
Miguel Santos@migssantos
AI
31 October 2025

insufficient availability of high-quality training data" - that's what Midhat said. For us building AI tools for BPOs, getting clean, labeled data for specific call center scenarios is a nightmare. Everyone talks about "data abundance" but try finding a good dataset for Tagalog-English sentiment analysis that isn't full of noise or irrelevant slang. It's a real bottleneck.

Leave a Comment

Your email will not be published