When the skies look limitless, perhaps it’s the earth that’s running out — or maybe we just haven’t learned to dig deeper.
Data scarcity in AI is less about total paucity, more about usable, domain‑specific, high‑quality data
The core challenge is not supplying more data but leveraging existing data (including private data) more intelligently
Synthetic data, rephrasing, and distillation techniques will be essential — though they carry risks (e.g. model collapse)
The future is dynamic: models will be tuned continuously, so building robust data pipelines and governance is non-negotiable
Open vs closed model debates matter less than data access, control, and smart utilisation
Let’s start with a provocation: the biggest constraint in AI today might not be chips, compute, or architecture it might be data. We often assume that because the internet is vast, there’s more than enough material to feed every new model. But what if, in many domains, we are already bumping into a “data wall”?
That’s precisely the conversation unfolding in expert circles. The “data scarcity” argument feels counterintuitive in an age of data abundance. Yet the nuance matters: it’s not whether data exists, but whether useful, high‑quality, domain‑specific, and legally accessible data exists. And on those fronts, cracks are appearing.
What Is Data Scarcity — and Why It Matters
When AI practitioners speak of data scarcity, they rarely mean a total absence of data. Rather, the challenge is the shortage of high‑signal, representative, legally usable data in particular domains.
“Data scarcity in AI refers to the insufficient availability of high-quality training data, hindering the development of effective machine learning models and leading to reduced AI performance.” — Midhat Tilawat, AI technologist at All About AI
“Data scarcity in AI refers to the insufficient availability of high-quality training data, hindering the development of effective machine learning models and leading to reduced AI performance.” — Midhat Tilawat, AI technologist at All About AI
Traditional machine learning has long contended with related themes: the curse of dimensionality, underfitting versus overfitting, bias and variance tradeoffs. In effect, many of the same tensions reappear in modern deep learning.
This scarcity manifests most sharply when building AI systems in narrow or specialised domains, or for small languages or niche verticals. In such cases:
Models may lack enough training examples to generalise safely,The cost (monetary, legal, logistical) of collecting and cleaning data escalates,The trade‑off between “quantity” and “quality” becomes brutal
Some analysts have even declared that “the Internet is a vast ocean of human knowledge, but it isn’t infinite,” and that AI researchers have “nearly sucked it dry.” — Nicola Jones, Journalist at Nature
Some analysts have even declared that “the Internet is a vast ocean of human knowledge, but it isn’t infinite,” and that AI researchers have “nearly sucked it dry.” — Nicola Jones, Journalist at Nature
Enjoying this? Get more in your inbox.
Weekly AI news & insights from Asia.
Open vs Closed: Where Data Lives, and Who Controls It
A revealing moment came in a panel at the Stanford “Imagination in Action” conference. Marcie Vu (Greycroft) and Ari Morcos (Datology), speaking with Julie Choi (Cerebras), spent much of their time unpicking the logistics of data pipelines, ownership, and the trade‑offs between open and closed systems.
“Two years ago … there was a widely held belief that closed source models would be just so much better … that there was no chance to compete.” — Ari Morcos, Co-founder at Datology
“Two years ago … there was a widely held belief that closed source models would be just so much better … that there was no chance to compete.” — Ari Morcos, Co-founder at Datology
Morcos noted that beliefs around open source lagging behind have softened. But he stressed that even more than architecture, it’s the way data is handled filtered, sequenced, curated that will separate successful systems from brittle ones.
Vu reinforced this point from the investor lens, suggesting that startups “AI-accelerated” by proprietary data strategies could outperform even technically superior competitors.
Synthetic, Augmented, Rephrased: Stretching the Data You Already Have
If good data is scarce, the next best bet is stretching what we already have. This is where techniques like rephrasing and synthetic generation come into play.
Synthetic Data
Synthetic data can help fill gaps in training, balance skewed datasets, or enable safer training in sensitive sectors like healthcare or finance.
Yet it’s not without risk. If synthetic data is derived from model outputs, we risk model collapse, where systems re-learn their own limited understanding without fresh insight.
“You can only ever teach a model something that the synthetic data generating model already understood.” — Ari Morcos, Datology
“You can only ever teach a model something that the synthetic data generating model already understood.” — Ari Morcos, Datology
Rephrasing and Augmentation
“Rephrasing” involves restructuring existing data to yield new inputs. It’s cheaper and often safer than full synthetic generation. Morcos explained how their tooling enables companies to take internal proprietary data and reformat it at scale for AI readiness — affordably.
“Rather than that just being a synthetic dataset, you can now feed in your own data and have that be augmented and rephrased at scale, in a really effective way.” — Ari Morcos
“Rather than that just being a synthetic dataset, you can now feed in your own data and have that be augmented and rephrased at scale, in a really effective way.” — Ari Morcos
What the Future Holds: Continual Learning, Governance, and Enterprise Intelligence
The panel closed with reflections on the near-term future. Vu mentioned applied robotics; Morcos pointed to a coming wave of continual fine-tuning, where models evolve in real-time from incoming data.
This implies a new AI operating model:
Real-time ingestion, processing, and validation,Continuous updating and governance of models,Heavy focus on data security, especially in enterprise contexts
As AI spreads across sectors in Asia — from logistics in Vietnam to healthcare in Singapore — enterprises must answer a hard question: Do we truly know how to use the data we already have?
The real AI breakthroughs won’t come from just bigger models or cheaper GPUs. They’ll come from firms that build the culture, infrastructure, and trust to harness their own data. For instance, understanding how AI recalibrated the value of data is crucial. This shift also impacts how businesses approach generative AI adoption and manage their existing information.
Is your organisation making the most of the data it already has? Or are you staring at a goldmine and waiting for someone else to dig? To stay competitive, businesses need to adapt, much like considering what every worker needs to answer: What is your non-machine premium?.















Latest Comments (2)
Ah, this is a topic that resonates deeply, particularly here in France where we're often preoccupied with data sovereignty and the nuances of intellectual property. The article touches on proprietary pipelines, and it makes me wonder: how feasible is it, truly, for *every* company to build out these sophisticated data engines? I can see the behemoths doing it, certainly. But for the smaller enterprises, the *PME* as we call them, it feels like this could create a massive chasm. Are we heading towards a future where only the data-rich can innovate with advanced AI, or are there truly accessible, shared solutions on the horizon that don't compromise competitive advantage? It’s a very practical concern, this data scarcity.
This article really zeroes in on a pressing issue. I'm curious, for companies building proprietary pipelines here in Southeast Asia, are we seeing more collaboration between businesses to pool domain specific datasets, or is everyone still largely trying to go it alone?
Leave a Comment