AI: “If You Don’t Want to Join ‘em, Here’s How to Beat ‘em.”

Publishers of websites have been in a panic about how to stop their content from being used to train generative AI technologies. In previous articles, when it comes to AI, I’ve had the attitude that if you can’t beat ‘em, join ‘em. And I still do. But just to prove that I can see the other side, here’s how to protect the content on your website from both legal and technical perspectives. 

For those of you that know me,  you know that I hate reading articles where the author conveys big ideas that don’t help me on a practical level. So, this article will give you concrete and specific proactive legal and technical steps you can take to protect the intellectual property on your website.

OK, so we all know that no one ever reads them, but the language of the terms and conditions on your website will determine whether or not you are legally protected from unwarranted use of your content. Drafting terms and conditions to prevent AI from legally using the content on a website for machine learning purposes requires precise legal language. While it’s challenging to entirely prevent AI from accessing and using website content, there are several provisions that can help mitigate the risk. 

Steps to Mitigate Risk

  • First and foremost, you must emphatically assert, in bold print, your copyright and intellectual property rights over the content on your website. Make it simple and unambiguous. 
  • State that all content displayed on your website, including text, images, videos and other materials is protected by copyright and that any use, reproduction, or distribution without your express written consent is unauthorized and prohibited.
  • Make your legal position even more detailed, by including provisions that expressly prohibit the scraping, crawling or automated harvesting of content from your website by AI or any other automated means. 
  • Specify that access to the website is granted for personal, non-commercial use only and that any use for commercial or automated purposes is strictly prohibited.
  • Redundancy is a good thing to ensure that your terms and conditions are unambiguous. 
  • Enumerate specific activities that are not permitted without express written consent, including but not limited to data mining, data scraping, data extraction or any other form of automated data collection or analysis.

Additional Suggestions 

  • Require users to agree to the terms and conditions before accessing or using the website. 
  • Implement a clickwrap or browsewrap agreement that requires users to affirmatively consent to the terms, including the restrictions on AI use.
  • Ask users to indemnify and hold harmless the website owner from any claims, damages or liabilities arising out of their unauthorized use of the website, including AI-based scraping or data mining activities.
  • Reserve the right to update, modify or revise the terms and conditions at any time and notify users of any changes. 

Continued use of the website after changes are made constitutes acceptance of the updated terms. You should also reserve the right to monitor and enforce compliance with the terms and conditions, including taking legal action against violators, pursuing damages for copyright infringement or seeking injunctive relief to prevent further unauthorized use.

It’s important to consult with legal counsel experienced in internet law and intellectual property to ensure that your terms and conditions are drafted effectively and are enforceable. Additionally, while these provisions can help deter unauthorized AI use, complete prevention is not possible, and proactive monitoring and enforcement is necessary to protect your rights.

Please note that no matter how precise the language of your terms and conditions is, I don’t believe that any language will prevent the unauthorized use of your content by AI technologies. It will, however, put you in a stronger position if you choose to pursue legal action against a violator. But if you think you’re going to beat Google in court, I hope your lawyers are smarter than me. 

So, in addition to beefed up terms and conditions, you have to consider what you can do from a technical perspective to prevent the unauthorized use of your content. I’m a lawyer, not a computer engineer, but I’ve been in this business long enough to lay out what you should do from a high level. 

Preventing AI technologies from taking content from your website for ingestion into machine learning systems can be challenging, as determined individuals or organizations consistently and diligently find ways to bypass technical obstacles. However, there are several barriers that you can erect to make it more difficult for AI to access and scrape content from your website.

We’ll start with the easiest and least expensive measures  to deter automated access to your website by bots or AI systems. 

  1. Rate Limiting
  3. IP blocking 

Rate Limiting puts a cap on how often someone can repeat an action within a certain timeframe. The technique limits network traffic to prevent users from exhausting system resources. Rate limiting makes it harder for malicious actors to overburden the system and cause attacks like Denial of Service (DoS). This involves attackers flooding a target system with requests and consuming too much network capacity, storage and memory. This will help prevent excessive scraping or crawling activities.

Implement CAPTCHA Challenges (Completely Automated Public Turing test to tell Computers and Humans Apart) on sensitive pages or actions to verify that users are human. Such as the common grid of images where you’re asked to select the ones containing a specific object, like traffic lights, bicycles, or storefronts. CAPTCHA can help deter automated bots from accessing your website, but it is not fail safe.

Create and properly configure a robots.txt file to instruct web crawlers which pages of your website they are allowed or not allowed to access. While this won’t prevent determined actors, it can deter some automated scrapers.

IP Blocking identifies and blocks IP addresses associated with suspicious or abusive behavior, such as repeated scraping attempts. Be cautious with IP blocking, as it can potentially affect legitimate users sharing the same IP address.

I’d also recommend some of the more complex and costly methods. 

Consider implementing a Content Security Policy (CSP) to control which external resources can be loaded by your website. CSP can help prevent malicious scripts from accessing and exfiltrating your content.

Use HTTPS encryption to secure communications between your website and visitors’ browsers. This can help protect against man-in-the-middle attacks and unauthorized interception of data.

Load content dynamically via client-side JavaScript after the initial page load. This can make it more difficult for scrapers to extract content programmatically.

Research anti-scraping services or tools that specialize in detecting and mitigating scraping activities. These services often employ advanced techniques such as fingerprinting and behavior analysis to identify and block scrapers.

It’s important to monitor and filter incoming requests based on user-agent strings. Some scrapers may attempt to mask their identity by pretending to be legitimate web browsers. By filtering out suspicious user-agent strings, you can block known bot behaviors.

While these technical measures can help deter and mitigate scraping activities, it’s critical to regularly monitor your website’s traffic and activity for any signs of unauthorized access or scraping. Additionally, staying informed about emerging scraping techniques and continuously updating your defenses can help maintain the security and integrity of your website’s content.

This is not an easy problem to solve. I think I’ve given you a fairly exhaustive list of legal and technical mechanisms to combat AI technologies from stealing your content. But as I’ve repeatedly warned, nothing is fool-proof.  So, the question you have to honestly answer is, “Are you sure that you would rather try to beat ‘em than join ’em?”