AI Data Scarcity: Asia-Pacific's Next AI Bottleneck

When the skies look limitless, perhaps it’s the earth that’s running out — or maybe we just haven’t learned to dig deeper.

Data scarcity in AI is less about total paucity, more about usable, domain‑specific, high‑quality data

The core challenge is not supplying more data but leveraging existing data (including private data) more intelligently

Synthetic data, rephrasing, and distillation techniques will be essential — though they carry risks (e.g. model collapse)

The future is dynamic: models will be tuned continuously, so building robust data pipelines and governance is non-negotiable

Open vs closed model debates matter less than data access, control, and smart utilisation

Let’s start with a provocation: the biggest constraint in AI today might not be chips, compute, or architecture it might be data. We often assume that because the internet is vast, there’s more than enough material to feed every new model. But what if, in many domains, we are already bumping into a “data wall”?

That’s precisely the conversation unfolding in expert circles. The “data scarcity” argument feels counterintuitive in an age of data abundance. Yet the nuance matters: it’s not whether data exists, but whether useful, high‑quality, domain‑specific, and legally accessible data exists. And on those fronts, cracks are appearing.

What Is Data Scarcity — and Why It Matters

When AI practitioners speak of data scarcity, they rarely mean a total absence of data. Rather, the challenge is the shortage of high‑signal, representative, legally usable data in particular domains.

“Data scarcity in AI refers to the insufficient availability of high-quality training data, hindering the development of effective machine learning models and leading to reduced AI performance.” — Midhat Tilawat, AI technologist at All About AI

Traditional machine learning has long contended with related themes: the curse of dimensionality, underfitting versus overfitting, bias and variance tradeoffs. In effect, many of the same tensions reappear in modern deep learning.

This scarcity manifests most sharply when building AI systems in narrow or specialised domains, or for small languages or niche verticals. In such cases:

Models may lack enough training examples to generalise safely,The cost (monetary, legal, logistical) of collecting and cleaning data escalates,The trade‑off between “quantity” and “quality” becomes brutal

Some analysts have even declared that “the Internet is a vast ocean of human knowledge, but it isn’t infinite,” and that AI researchers have “nearly sucked it dry.” — Nicola Jones, Journalist at Nature

Open vs Closed: Where Data Lives, and Who Controls It

A revealing moment came in a panel at the Stanford “Imagination in Action” conference. Marcie Vu (Greycroft) and Ari Morcos (Datology), speaking with Julie Choi (Cerebras), spent much of their time unpicking the logistics of data pipelines, ownership, and the trade‑offs between open and closed systems.

“Two years ago … there was a widely held belief that closed source models would be just so much better … that there was no chance to compete.” — Ari Morcos, Co-founder at Datology

Morcos noted that beliefs around open source lagging behind have softened. But he stressed that even more than architecture, it’s the way data is handled filtered, sequenced, curated that will separate successful systems from brittle ones.

Vu reinforced this point from the investor lens, suggesting that startups “AI-accelerated” by proprietary data strategies could outperform even technically superior competitors.

Synthetic, Augmented, Rephrased: Stretching the Data You Already Have

If good data is scarce, the next best bet is stretching what we already have. This is where techniques like rephrasing and synthetic generation come into play.

Synthetic Data

Synthetic data can help fill gaps in training, balance skewed datasets, or enable safer training in sensitive sectors like healthcare or finance.

Yet it’s not without risk. If synthetic data is derived from model outputs, we risk model collapse, where systems re-learn their own limited understanding without fresh insight.

“You can only ever teach a model something that the synthetic data generating model already understood.” — Ari Morcos, Datology

Rephrasing and Augmentation

“Rephrasing” involves restructuring existing data to yield new inputs. It’s cheaper and often safer than full synthetic generation. Morcos explained how their tooling enables companies to take internal proprietary data and reformat it at scale for AI readiness — affordably.

“Rather than that just being a synthetic dataset, you can now feed in your own data and have that be augmented and rephrased at scale, in a really effective way.” — Ari Morcos

What the Future Holds: Continual Learning, Governance, and Enterprise Intelligence

The panel closed with reflections on the near-term future. Vu mentioned applied robotics; Morcos pointed to a coming wave of continual fine-tuning, where models evolve in real-time from incoming data.

This implies a new AI operating model:

Real-time ingestion, processing, and validation,Continuous updating and governance of models,Heavy focus on data security, especially in enterprise contexts

As AI spreads across sectors in Asia — from logistics in Vietnam to healthcare in Singapore — enterprises must answer a hard question: Do we truly know how to use the data we already have?

The real AI breakthroughs won’t come from just bigger models or cheaper GPUs. They’ll come from firms that build the culture, infrastructure, and trust to harness their own data. For instance, understanding how AI recalibrated the value of data is crucial. This shift also impacts how businesses approach generative AI adoption and manage their existing information.

Is your organisation making the most of the data it already has? Or are you staring at a goldmine and waiting for someone else to dig? To stay competitive, businesses need to adapt, much like considering what every worker needs to answer: What is your non-machine premium?.

Latest Comments (2)

Priya Ramasamy@priyaram

13 November 2025

legally usable data" is a big one for us in malaysia. with all the data residency laws and internal compliance, getting access to some public datasets, let alone proprietary ones from partners, is a nightmare. it's not just about quality, it's about what we're even allowed to touch here.

Miguel Santos@migssantos

31 October 2025

insufficient availability of high-quality training data" - that's what Midhat said. For us building AI tools for BPOs, getting clean, labeled data for specific call center scenarios is a nightmare. Everyone talks about "data abundance" but try finding a good dataset for Tagalog-English sentiment analysis that isn't full of noise or irrelevant slang. It's a real bottleneck.

Cookie Consent

Running Out of Data: The Strange Problem Behind AI's Next Bottleneck

AI Snapshot

Share your thoughts

Floating Data Centres Tackle Energy Crisis

This is a developing story

You Might Also Like

Floating Data Centres Tackle Energy Crisis

Small Business Wins in the AI Era

Free ChatGPT's True Cost Revealed

The Asian Honeymoon Is Over: Why Workers Are Losing Faith in AI

Comments (2)

Latest Comments (2)

Leave a Comment