Cookie Consent

    We use cookies to enhance your browsing experience, serve personalised ads or content, and analyse our traffic. Learn more

    Life

    Protect Your Writing from AI Bots: A Simple Guide

    This article explains how to protect your writing from AI bots using the robots.txt file, and discusses the copyright issues surrounding AI models.

    Anonymous
    4 min read5 November 2024
    Protect writing from AI

    AI Snapshot

    The TL;DR: what matters, fast.

    AI models require vast amounts of text for training, often scraping content without permission, as highlighted by The New York Times' lawsuit against OpenAI.

    AI companies face a potential content crisis as human-generated text could run out by 2026, hindering further AI development.

    Protect your writing by using the robots.txt file to block AI bots and web crawlers like Common Crawl from accessing your website's content.

    Who should pay attention: Writers | Publishers | AI developers | Copyright lawyers

    What changes next: The legal landscape around AI training data and copyright is likely to evolve rapidly.

    AI models like ChatGPT use vast amounts of text, often without permission.,The New York Times has sued OpenAI for copyright infringement.,You can protect your writing by editing your robots.txt file.

    The Rise of AI and Its Hunger for Words

    Artificial Intelligence (AI) is transforming the world, but it comes with challenges. AI models like ChatGPT require enormous amounts of text to train. For instance, the first version of ChatGPT was trained on about 300 billion words. That's equivalent to writing a thousand words a day for over 800,000 years!

    But where does all this text come from? Often, it's scraped from the internet without permission, raising serious copyright concerns.

    The Case of The New York Times vs. OpenAI

    In a high-profile case, The New York Times sued OpenAI, the company behind ChatGPT, for copyright infringement. The lawsuit alleges that OpenAI scraped millions of articles from The New York Times and used them to train its AI models. Sometimes, these models even reproduce chunks of text verbatim.

    "OpenAI made three hundred million in August and expects to hit $3.7 billion this year." - The New York Times

    "OpenAI made three hundred million in August and expects to hit $3.7 billion this year." - The New York Times

    This raises a crucial question: How would you feel if AI models were using your writing without permission?

    The Looming Content Crisis

    AI companies face a potential content crisis. A study by Epoch AI suggests that AI models could run out of human-generated content as early as 2026. This could lead to stagnation, as AI models need fresh content to keep improving.

    "The AI field might face challenges in maintaining its current pace of progress once it drains the reserves of human-generated writing." - Tamay Besiroglu, author of the Epoch AI study

    "The AI field might face challenges in maintaining its current pace of progress once it drains the reserves of human-generated writing." - Tamay Besiroglu, author of the Epoch AI study

    Protecting Your Writing: The robots.txt File

    So, how can you protect your writing? The solution lies in a simple text file called robots.txt. This file tells robots (including AI bots) what they can and can't access on your website.

    Here's how it works:

    Enjoying this? Get more in your inbox.

    Weekly AI news & insights from Asia.

    User-agent: This is the name of the robot. For example, 'GPTBot' for ChatGPT.,Disallow: This means 'no'.,The slash (/): This means the whole website or account.

    So, if you want to block ChatGPT from accessing your writing, you would add this to your robots.txt file:

    User-agent: GPTBot Disallow: /

    How to Edit Your robots.txt File

    If you have your own website, you can edit the robots.txt file to block AI bots.

    Here's how:

    Using the Yoast SEO plugin: Go to Yoast > Tools > File Editor.,Using FTP access: The robots.txt file is in the root directory.,Using the WP Robots Txt plugin: This is a simple, non-technical solution. Just go to Plugins > Add New, then type in 'WP Robots Txt' and click install.

    Once you're in the robots.txt file, copy and paste the following to block common AI bots:

    User-agent: GPTBot Disallow: /

    User-agent: ChatGPT-User Disallow: /

    User-agent: Google-Extended Disallow: /

    User-agent: Omgilibot Disallow: /

    User-agent: ClaudeBot Disallow: /

    User-agent: Claude-Web Disallow: /

    The Common Crawl Dilemma

    Common Crawl is a non-profit organisation that creates a copy of the internet for research and analysis. Unfortunately, OpenAI used Common Crawl data to train its AI models. If you want to block Common Crawl, add this to your robots.txt file:

    User-agent: CCBot Disallow: /

    The Future of AI and Copyright Law

    The future of AI and copyright law is uncertain. Until the laws change, the best way to protect your writing is to block AI bots using the robots.txt file.

    "Until they change copyright laws and intellectual property laws and give the rights to he with the most money — your words are yours."

    "Until they change copyright laws and intellectual property laws and give the rights to he with the most money — your words are yours."

    Comment and Share:

    How do you feel about AI models using your writing without permission? Have you checked your robots.txt file? Share your thoughts and experiences below. And don't forget to Subscribe to our newsletter for updates on AI and AGI developments!

    Anonymous
    4 min read5 November 2024

    Share your thoughts

    Join 6 readers in the discussion below

    Latest Comments (6)

    Grace Lim
    Grace Lim@gracelim_sg
    AI
    27 November 2025

    This piece brings up a super valid point. I remember when *robots.txt* started gaining traction for this sort of thing a while back; it felt like a niche tech-y detail. Nowadays, with so many folks concerned about their content being slurped up, it's definitely a practical approach. Copyright in the AI space is a real headache for creators.

    Leonard Pang
    Leonard Pang@leo_pang_sg
    AI
    12 November 2025

    Interesting read. So, if your robots.txt is perfect, is there still a chance your content feeds the bots without explicit permission? Just curious, lah.

    Mohd Faiz
    Mohd Faiz@mohd_f_ai
    AI
    7 November 2025

    This was a really helpful read, thanks! The robots.txt angle is clever, but I wonder about the 'fair use' defence AI companies often bring up. Could that override our wishes even with the file in place, especially for publicly available works? It's a bit worrying, lah.

    Gabriel Tan
    Gabriel Tan@gab_tan_ph
    AI
    7 January 2025

    Good read! The robots.txt angle is definitely a prime defence, and it's good to see that emphasised. Yet, I wonder if these AI chaps will always respect it. Like, what’s to stop them from just ignoring a disallow rule eventually? It's a proper quandary for us creatives.

    Theresa Go
    Theresa Go@theresa_g
    AI
    10 December 2024

    This is a timely read, even now. The whole situation with AI hoovering up content has been a developing story, and it’s good to see practical advice like using `robots.txt`. I’m particularly curious about the long-term effectiveness of this approach, though. With AI models becoming more sophisticated, do we foresee a future where they simply disregard such directives, perhaps by just scraping the rendered HTML directly, bypassing the `robots.txt` entirely? Or is the legal framework, the one mentioned about copyright issues, our stronger defence? It feels like a cat-and-mouse game, doesn’t it?

    Deepika Rajan@deepika_r_ai
    AI
    3 December 2024

    This was a rather illuminating read, particularly the bits about robots.txt. However, I can't help but wonder if this is just a temporary bandage, you know? While we're all scrambling to protect our words, these AI models are evolving at quite a rapid pace. It feels like a cat and mouse game where the 'bots' are always a step ahead. Perhaps a more proactive approach, something beyond just blocking them, is needed? Focusing solely on preventing access seems a bit like closing the stable doors after the horse has bolted, especially with the sheer volume of data already out there. Copyright is one thing, but the practical implications are something else entirely.

    Leave a Comment

    Your email will not be published