Strategies for Protecting Your Website Content from AI Scraping
By Paul Gerbino
The media landscape has been going through a seismic shift from the launch of the World Wide Web. Nothing has been more seismic than what is being driven by the explosive growth of Generative Artificial Intelligence (GenAI) and Large Language Models (LLMs). These powerful technological developments and their algorithms are capable of creating human-like text, images, and even audio. They are all built on one fundamental raw material–your content. And they require immense quantities of that content for training as well as Retrieval-Augmented Generation (RAG). As easily accessed content, what we at Creative Licensing International (CLI) call “open web access” and what most call “free on the web” is starting to run dry (because AI already stole it all), AI developers are increasingly looking towards content behind paywalls or registration. This aggressive pursuit of training data, often involving the “scraping” behind the firewall, has ignited a fierce debate about intellectual property (IP) rights, fair use, and the very future of the creator economy.
Every piece of content, from a blog post to a photograph, is subject to copyright, a legal framework designed to protect the rights of content creators. As you have read in our past articles, the U.S. Copyright Office has identified several critical areas where these rights are being challenged by the advent of GenAI.
Copyright Infringement vs. Fair Use in AI Training
Content scraping and reproduction is at the heart of the current legal battles between AI and content creators. Creators such as writer/comedian Sarah Silverman are taking legal action against AI for the claim that AI is partaking in unauthorized extraction of their works to train. Whether it is text, images, or audio files, Silverman claims that this theft constitutes a clear infringement of her exclusive rights to reproduce and publicly display her creations. This “taking” of content, she argues, is a direct violation of her copyright.
However, AI developers frequently invoke the fair-use doctrine as their primary defense. Fair use, which we have covered in past articles, is a legal principle which permits the limited use of copyrighted material without permission for purposes such as criticism, comment, news reporting, teaching, scholarship, or research. The core argument here is that training an AI model, much like a human learning from various sources, is a transformative act that falls within these permissible boundaries.
The courts, however, are grappling with the unprecedented complexities of applying fair use in the context of AI. A pivotal factor in these deliberations is whether the AI’s output directly competes with or replaces the original copyrighted work. Recent rulings, such as the Thomson Reuters v. Ross Intelligence case, have sent a clear message: the commercial use of copyrighted material for AI training that results in a directly competing product is unlikely to qualify as fair use. This suggests a growing judicial skepticism towards blanket fair use claims by AI developers.
With all that said, we are often asked what publishers can do to protect themselves from this theft of content. Let’s look at some practical steps website owners can take to protect content
Proactive, Not Reactive
Given the evolving landscape, copyright owners, particularly those with online publications, should adopt proactive measures to protect their content. While complete immunity is difficult to achieve, several easy-to-implement steps can significantly strengthen your defenses:
- Update Your robots.txt file: This is your first line of defense. Create or modify your robots.txt file in your website’s root directory (www.yourwebsite.com/robots.txt). Explicitly disallow known AI crawlers like GPTBot, Google-Extended, CCBot, ClaudeBot, and Facebot from accessing your content. While not legally binding, many reputable AI companies respect these directives.
To list all the bots you need to know which bots are scraping your content. More on that later in the list. - Add Clear Terms of Service (ToS) and Copyright Notices: Legally declare that scraping and AI training are prohibited. Create a dedicated “Terms of Service” page with explicit clauses forbidding “Automated scraping, data mining, extraction, and the use of content for training artificial intelligence models without express written permission.” Additionally, place a prominent copyright notice (e.g., © [Current Year] Your Website Name. All Rights Reserved.) in your website’s footer.
- Leverage Technology: Technologies such as ScalePost or Cloudflare offer robust features that can identify and challenge bot traffic. Enable “Under Attack Mode” during suspected heavy scraping, and utilize their firewall rules to block or challenge traffic based on suspicious User-Agents or behavior.
- Implement Basic Rate Limiting: Prevent a single IP address from making an excessive number of requests in a short period. Cloudflare offers basic rate limiting, and some web hosts provide similar options in their control panels. This makes it harder for bots to rapidly scrape large volumes of content.
- Monitor Your Site Traffic Regularly: Keep a close eye on your website analytics (e.g., Google Analytics). Look for unusual spikes in traffic from single IPs, very short visit durations combined with high page views, or strange navigation patterns, all of which can signal bot activity.
- Use Simple Copyright Notices on All Content: A subtle visual watermark on images with your logo or website name, and a small line of text at the end of articles (“Original content on [Your Website Name]. Do not copy for AI training.”), can serve as a constant, visible reminder of your ownership.
The fight for your content in the age of AI is just beginning. As AI gets smarter and demands more data, copyright holders must stay on guard, update their legal tactics, and use technology to defend their work and their income. Copyright owners must remain vigilant, adapting their legal and technical strategies to protect their creative works and ensure their continued livelihood in an increasingly AI-driven world. Proactivity, clear legal declarations, and smart technological defenses are the cornerstones of safeguarding your digital assets.
About Paul Gerbino
Paul Gerbino is the President of Creative Licensing International. He is an expert in digital, content strategy, licensing, product development, advertising, and copyright protections. His expertise is noted with an exemplary track record of transforming low-performing projects into highly profitable revenue streams. Evident in creating and launching innovative digital media products and advertising programs for B2B, B2C, STM, and academic publishers. Paul is passionate about helping publishers improve their performance, productivity, and profitability in the evolving digital landscape.