AI models like ChatGPT use vast amounts of text, often without permission.,The New York Times has sued OpenAI for copyright infringement.,You can protect your writing by editing your robots.txt file.
The Rise of AI and Its Hunger for Words
Artificial Intelligence (AI) is transforming the world, but it comes with challenges. AI models like ChatGPT require enormous amounts of text to train. For instance, the first version of ChatGPT was trained on about 300 billion words. That's equivalent to writing a thousand words a day for over 800,000 years!
But where does all this text come from? Often, it's scraped from the internet without permission, raising serious copyright concerns.
The Case of The New York Times vs. OpenAI
In a high-profile case, The New York Times sued OpenAI, the company behind ChatGPT, for copyright infringement. The lawsuit alleges that OpenAI scraped millions of articles from The New York Times and used them to train its AI models. Sometimes, these models even reproduce chunks of text verbatim.
"OpenAI made three hundred million in August and expects to hit $3.7 billion this year." - The New York Times
"OpenAI made three hundred million in August and expects to hit $3.7 billion this year." - The New York Times
This raises a crucial question: How would you feel if AI models were using your writing without permission?
The Looming Content Crisis
AI companies face a potential content crisis. A study by Epoch AI suggests that AI models could run out of human-generated content as early as 2026. This could lead to stagnation, as AI models need fresh content to keep improving.
"The AI field might face challenges in maintaining its current pace of progress once it drains the reserves of human-generated writing." - Tamay Besiroglu, author of the Epoch AI study
"The AI field might face challenges in maintaining its current pace of progress once it drains the reserves of human-generated writing." - Tamay Besiroglu, author of the Epoch AI study
Protecting Your Writing: The robots.txt File
So, how can you protect your writing? The solution lies in a simple text file called robots.txt. This file tells robots (including AI bots) what they can and can't access on your website.
Here's how it works:
Enjoying this? Get more in your inbox.
Weekly AI news & insights from Asia.
User-agent: This is the name of the robot. For example, 'GPTBot' for ChatGPT.,Disallow: This means 'no'.,The slash (/): This means the whole website or account.
So, if you want to block ChatGPT from accessing your writing, you would add this to your robots.txt file:
User-agent: GPTBot Disallow: /
How to Edit Your robots.txt File
If you have your own website, you can edit the robots.txt file to block AI bots.
Here's how:
Using the Yoast SEO plugin: Go to Yoast > Tools > File Editor.,Using FTP access: The robots.txt file is in the root directory.,Using the WP Robots Txt plugin: This is a simple, non-technical solution. Just go to Plugins > Add New, then type in 'WP Robots Txt' and click install.
Once you're in the robots.txt file, copy and paste the following to block common AI bots:
User-agent: GPTBot Disallow: /
User-agent: ChatGPT-User Disallow: /
User-agent: Google-Extended Disallow: /
User-agent: Omgilibot Disallow: /
User-agent: ClaudeBot Disallow: /
User-agent: Claude-Web Disallow: /
The Common Crawl Dilemma
Common Crawl is a non-profit organisation that creates a copy of the internet for research and analysis. Unfortunately, OpenAI used Common Crawl data to train its AI models. If you want to block Common Crawl, add this to your robots.txt file:
User-agent: CCBot Disallow: /
The Future of AI and Copyright Law
The future of AI and copyright law is uncertain. Until the laws change, the best way to protect your writing is to block AI bots using the robots.txt file.
"Until they change copyright laws and intellectual property laws and give the rights to he with the most money — your words are yours."
"Until they change copyright laws and intellectual property laws and give the rights to he with the most money — your words are yours."
Comment and Share:
How do you feel about AI models using your writing without permission? Have you checked your robots.txt file? Share your thoughts and experiences below. And don't forget to Subscribe to our newsletter for updates on AI and AGI developments!









Latest Comments (6)
This piece brings up a super valid point. I remember when *robots.txt* started gaining traction for this sort of thing a while back; it felt like a niche tech-y detail. Nowadays, with so many folks concerned about their content being slurped up, it's definitely a practical approach. Copyright in the AI space is a real headache for creators.
Interesting read. So, if your robots.txt is perfect, is there still a chance your content feeds the bots without explicit permission? Just curious, lah.
This was a really helpful read, thanks! The robots.txt angle is clever, but I wonder about the 'fair use' defence AI companies often bring up. Could that override our wishes even with the file in place, especially for publicly available works? It's a bit worrying, lah.
Good read! The robots.txt angle is definitely a prime defence, and it's good to see that emphasised. Yet, I wonder if these AI chaps will always respect it. Like, what’s to stop them from just ignoring a disallow rule eventually? It's a proper quandary for us creatives.
This is a timely read, even now. The whole situation with AI hoovering up content has been a developing story, and it’s good to see practical advice like using `robots.txt`. I’m particularly curious about the long-term effectiveness of this approach, though. With AI models becoming more sophisticated, do we foresee a future where they simply disregard such directives, perhaps by just scraping the rendered HTML directly, bypassing the `robots.txt` entirely? Or is the legal framework, the one mentioned about copyright issues, our stronger defence? It feels like a cat-and-mouse game, doesn’t it?
This was a rather illuminating read, particularly the bits about robots.txt. However, I can't help but wonder if this is just a temporary bandage, you know? While we're all scrambling to protect our words, these AI models are evolving at quite a rapid pace. It feels like a cat and mouse game where the 'bots' are always a step ahead. Perhaps a more proactive approach, something beyond just blocking them, is needed? Focusing solely on preventing access seems a bit like closing the stable doors after the horse has bolted, especially with the sheer volume of data already out there. Copyright is one thing, but the practical implications are something else entirely.
Leave a Comment