Cookie Consent

    We use cookies to enhance your browsing experience, serve personalised ads or content, and analyse our traffic. Learn more

    Install AIinASIA

    Get quick access from your home screen

    Business

    Running Out of Data: The Strange Problem Behind AI's Next Bottleneck

    This feature explores the growing challenge of data scarcity in artificial intelligence, revealing why high-quality, domain-specific data is becoming a bottleneck. Through insights from enterprise leaders, we look at synthetic data, proprietary pipelines, and the open vs closed model debate - all through a commercially grounded lens aimed at AI professionals across Asia.

    Anonymous
    6 min read16 October 2025
    AI data scarcity

    AI Snapshot

    The TL;DR: what matters, fast.

    AI's biggest constraint is data, not chips, compute, or architecture, due to a "data wall" being hit in many domains.

    The challenge of data scarcity in AI is not a total absence of data, but rather a shortage of high-signal, representative, legally usable data in particular domains.

    The future requires robust data pipelines and governance as models will be continuously tuned, with data access and smart utilization being more important than open vs. closed model debates.

    Who should pay attention: AI developers | Data scientists | Machine learning engineers | Researchers

    What changes next: Techniques for synthetic data generation will rapidly evolve.

    When the skies look limitless, perhaps it’s the earth that’s running out — or maybe we just haven’t learned to dig deeper.

    Data scarcity in AI is less about total paucity, more about usable, domain‑specific, high‑quality data

    The core challenge is not supplying more data but leveraging existing data (including private data) more intelligently

    Synthetic data, rephrasing, and distillation techniques will be essential — though they carry risks (e.g. model collapse)

    The future is dynamic: models will be tuned continuously, so building robust data pipelines and governance is non-negotiable

    Open vs closed model debates matter less than data access, control, and smart utilisation

    Let’s start with a provocation: the biggest constraint in AI today might not be chips, compute, or architecture it might be data. We often assume that because the internet is vast, there’s more than enough material to feed every new model. But what if, in many domains, we are already bumping into a “data wall”?

    That’s precisely the conversation unfolding in expert circles. The “data scarcity” argument feels counterintuitive in an age of data abundance. Yet the nuance matters: it’s not whether data exists, but whether useful, high‑quality, domain‑specific, and legally accessible data exists. And on those fronts, cracks are appearing.

    What Is Data Scarcity — and Why It Matters

    When AI practitioners speak of data scarcity, they rarely mean a total absence of data. Rather, the challenge is the shortage of high‑signal, representative, legally usable data in particular domains.

    “Data scarcity in AI refers to the insufficient availability of high-quality training data, hindering the development of effective machine learning models and leading to reduced AI performance.” — Midhat Tilawat, AI technologist at All About AI

    “Data scarcity in AI refers to the insufficient availability of high-quality training data, hindering the development of effective machine learning models and leading to reduced AI performance.” — Midhat Tilawat, AI technologist at All About AI

    Traditional machine learning has long contended with related themes: the curse of dimensionality, underfitting versus overfitting, bias and variance tradeoffs. In effect, many of the same tensions reappear in modern deep learning.

    This scarcity manifests most sharply when building AI systems in narrow or specialised domains, or for small languages or niche verticals. In such cases:

    Models may lack enough training examples to generalise safely,The cost (monetary, legal, logistical) of collecting and cleaning data escalates,The trade‑off between “quantity” and “quality” becomes brutal

    Some analysts have even declared that “the Internet is a vast ocean of human knowledge, but it isn’t infinite,” and that AI researchers have “nearly sucked it dry.” — Nicola Jones, Journalist at Nature

    Some analysts have even declared that “the Internet is a vast ocean of human knowledge, but it isn’t infinite,” and that AI researchers have “nearly sucked it dry.” — Nicola Jones, Journalist at Nature

    Enjoying this? Get more in your inbox.

    Weekly AI news & insights from Asia.

    Open vs Closed: Where Data Lives, and Who Controls It

    A revealing moment came in a panel at the Stanford “Imagination in Action” conference. Marcie Vu (Greycroft) and Ari Morcos (Datology), speaking with Julie Choi (Cerebras), spent much of their time unpicking the logistics of data pipelines, ownership, and the trade‑offs between open and closed systems.

    “Two years ago … there was a widely held belief that closed source models would be just so much better … that there was no chance to compete.” — Ari Morcos, Co-founder at Datology

    “Two years ago … there was a widely held belief that closed source models would be just so much better … that there was no chance to compete.” — Ari Morcos, Co-founder at Datology

    Morcos noted that beliefs around open source lagging behind have softened. But he stressed that even more than architecture, it’s the way data is handled filtered, sequenced, curated that will separate successful systems from brittle ones.

    Vu reinforced this point from the investor lens, suggesting that startups “AI-accelerated” by proprietary data strategies could outperform even technically superior competitors.

    Synthetic, Augmented, Rephrased: Stretching the Data You Already Have

    If good data is scarce, the next best bet is stretching what we already have. This is where techniques like rephrasing and synthetic generation come into play.

    Synthetic Data

    Synthetic data can help fill gaps in training, balance skewed datasets, or enable safer training in sensitive sectors like healthcare or finance.

    Yet it’s not without risk. If synthetic data is derived from model outputs, we risk model collapse, where systems re-learn their own limited understanding without fresh insight.

    “You can only ever teach a model something that the synthetic data generating model already understood.” — Ari Morcos, Datology

    “You can only ever teach a model something that the synthetic data generating model already understood.” — Ari Morcos, Datology

    Rephrasing and Augmentation

    “Rephrasing” involves restructuring existing data to yield new inputs. It’s cheaper and often safer than full synthetic generation. Morcos explained how their tooling enables companies to take internal proprietary data and reformat it at scale for AI readiness — affordably.

    “Rather than that just being a synthetic dataset, you can now feed in your own data and have that be augmented and rephrased at scale, in a really effective way.” — Ari Morcos

    “Rather than that just being a synthetic dataset, you can now feed in your own data and have that be augmented and rephrased at scale, in a really effective way.” — Ari Morcos

    What the Future Holds: Continual Learning, Governance, and Enterprise Intelligence

    The panel closed with reflections on the near-term future. Vu mentioned applied robotics; Morcos pointed to a coming wave of continual fine-tuning, where models evolve in real-time from incoming data.

    This implies a new AI operating model:

    Real-time ingestion, processing, and validation,Continuous updating and governance of models,Heavy focus on data security, especially in enterprise contexts

    As AI spreads across sectors in Asia — from logistics in Vietnam to healthcare in Singapore — enterprises must answer a hard question: Do we truly know how to use the data we already have?

    The real AI breakthroughs won’t come from just bigger models or cheaper GPUs. They’ll come from firms that build the culture, infrastructure, and trust to harness their own data. For instance, understanding how AI recalibrated the value of data is crucial. This shift also impacts how businesses approach generative AI adoption and manage their existing information.

    Is your organisation making the most of the data it already has? Or are you staring at a goldmine and waiting for someone else to dig? To stay competitive, businesses need to adapt, much like considering what every worker needs to answer: What is your non-machine premium?.

    Anonymous
    6 min read16 October 2025

    Share your thoughts

    Join 2 readers in the discussion below

    Latest Comments (2)

    Julien Simon
    Julien Simon@julien_s_ai
    AI
    25 October 2025

    Ah, this is a topic that resonates deeply, particularly here in France where we're often preoccupied with data sovereignty and the nuances of intellectual property. The article touches on proprietary pipelines, and it makes me wonder: how feasible is it, truly, for *every* company to build out these sophisticated data engines? I can see the behemoths doing it, certainly. But for the smaller enterprises, the *PME* as we call them, it feels like this could create a massive chasm. Are we heading towards a future where only the data-rich can innovate with advanced AI, or are there truly accessible, shared solutions on the horizon that don't compromise competitive advantage? It’s a very practical concern, this data scarcity.

    Stanley Yap
    Stanley Yap@stanleyY
    AI
    21 October 2025

    This article really zeroes in on a pressing issue. I'm curious, for companies building proprietary pipelines here in Southeast Asia, are we seeing more collaboration between businesses to pool domain specific datasets, or is everyone still largely trying to go it alone?

    Leave a Comment

    Your email will not be published