The arguments in favor of and against LLMs claiming fair use.
by: Frank Bilotto
Jackie O(h My!)
Thirteen years before the enactment of the1976 Copyright Act, a famous photograph of Jacqueline Kennedy was taken by photographer Stanley Tretick. The image was widely circulated by various media outlets and became iconic, particularly after President Kennedy’s assassination. The photo was initially taken by Tretick for Newsweek magazine, and Newsweek published it with a story on Jackie Kennedy.
The New York Review of Books decided to use the same photograph for a cover story about the Kennedy administration without first seeking permission from the copyright holder, Tretick. Newsweek attempted to stop the unauthorized use of the photograph, arguing that it was a violation of their copyright and that they held exclusive rights to the image.
The New York Review of Books defended their use of the photograph, claiming it was fair use, because they used it in a transformative way, to comment on the public’s perception of Jackie Kennedy and her role as the First Lady, rather than to directly exploit it for commercial gain. The argument was that the image had become so culturally significant that using it for commentary on the political climate was an act of journalistic critique, which falls under the fair use doctrine.
Ultimately, the legal case was never fully adjudicated, but it remains an example of how copyrighted works were sometimes used without permission in ways that could later be deemed fair use under modern copyright law.
At the time, the concept of fair use wasn’t as clearly defined as it would later become under the 1976 Copyright Act, but this case set the stage for broader discussions about the balance between protecting an author’s rights and allowing certain uses of works for purposes like commentary, education, and news reporting. Although this case was more about unauthorized use and less about the legal complexities we now associate with fair use, it closely resembles the scraping practices of today’s AI technologies and highlights the evolving tension between copyright protection and the need for public discourse and commentary.
The fair use claim made by New York Magazine is not unlike the 2023 statement by Brad Smith, President of Microsoft, who said, “The content that’s publicly available on the internet is available to be used [to train large language models].” With one bold statement, Smith essentially declared that fair (and unlimited) use is no longer relegated to being an affirmative defense, but effectively an inalienable right granted by God. Almost two years later, we’re no closer to resolving the issue. In fact, while you’re reading this article, thousands of pages of content will be scraped for use by large language models (“LLMs”).
So, what Is Fair Use, anyway?
Fair use is a legal doctrine in copyright law that allows limited use of copyrighted material without requiring permission from the copyright holder. Fair use is meant to balance the rights of the copyright owner with the public’s interest in using that material for things like criticism, commentary, news reporting, teaching, scholarship, or research.
However, let’s not play semantic games. The number one goal of most popular LLMs, like Chat GPT and Gemini is money, and by any means necessary. If the path to that money includes some criticism, commentary, news reporting, teaching, scholarship, and research along the way, so be it. I stand firm in my position that LLMs are purely commercial enterprises with the intent of making substantial profits and eliminating competition. Their claim of fair use must rest on other grounds.
The Early Days (Pre-1976)
Before the 1976 Copyright Act, the principle of fair use was developed through common law. For you non-lawyers, common law is a legal system that is based on judicial decisions and precedents, rather than written laws or statutes. It originated in England and is used in many countries, including the United States. Basically, it’s a system where the outcomes of past cases shape how similar cases will be decided in the future. Over time, this leads to a body of law that evolves based on court decisions, a judicially created doctrine.
Fair use was a defense against copyright infringement that emerged from court decisions over time. Think of it as a set of principles judges used to determine whether using copyrighted material without permission was acceptable. Courts aimed to balance the rights of copyright holders with the public interest in accessing and using creative works. There wasn’t a formal list of factors, judges relied on their discretion and common sense to determine if a use was fair.
In 1841 Judge Joseph Story introduced the concept of fair use in the case of Folsom v. Marsh. Story ruled that a two-volume set Folsom published, which copied 353 pages of Jared Sparks’ work on the writings of George Washington, was not fair use. Despite the fact that Folsom’s work was educational and non-commercial, the court found that the extent of the copying was too large and that it took the “heart” of Sparks’ work, therefore denying a fair use defense.
If taking the heart of another author’s work exceeds the threshold for fair use, certainly, scraping entire pages of content by LLMs should qualify as being unlawful.
Folsom v. Marsh established the foundations for the fair use factors that are still used today, particularly the importance of the purpose of the use, the amount of work used, and the impact on the market. These factors guided judges in subsequent cases for the next 135 years. One of the primary motivations behind fair use was to prevent copyright from being used as a tool for censorship. Courts were wary of giving copyright holders too much control over how their works could be used. They believed that a strict interpretation of copyright could stifle creativity and free expression.
So, where does that put LLMs? Remember, unlike the exact copying and publication of the Jackie Kennedy photo or the 353 pages of Sparks’ books on Washington’s writings, LLMs use the content they scrape to create derivative works. A reasonable argument is that LLMs simply significantly speed the process of a human being absorbing all of the content at the websites from which the LLMs scrape it and create derivative works from his/her knowledge of the content.
Hold that thought. Let’s get back to the timeline of fair use.
The 1976 Copyright Act
Common law was finally codified in the Act of 1976. Since then, fair use is determined by statute and interpreted by the courts.
Four factors determine if unauthorized use of copyrighted materials is fair.
- The purpose and character of the use.
Non-commercial, educational, or critical uses, such as commentary, criticism, or news reporting, are more likely to be considered fair use. If the use is transformative, meaning it adds something new or changes the original work in some way, it’s also more likely to be deemed fair use.
- The nature of the copyrighted work.
Factual works, like news reports, scientific articles, or historical documents, are more likely to be subject to fair use than creative works, like novels, plays, or artwork. This is simply because you can’t copyright a fact. So, factual works are generally considered to have less protection under copyright law.
- The amount and substantiality of the portion used.
Using a small amount of a work is more likely to be fair use, but if the portion used is considered the “heart” of the work, even if it’s a small part, it may weigh against fair use. The amount and significance of what is used play a key role in the decision.
- The effect of the use on the market or value of the copyrighted work.
Does the use of the work harm the market for the original work or its potential value? If the use competes with the original work and could reduce its sales or licensing opportunities, it’s less likely to be considered fair use. On the other hand, if the use doesn’t affect the market for example, if it’s non-commercial or doesn’t substitute for the original, it’s more likely to be seen as fair use.
These four factors are weighed together, and no single factor is determinative. The courts will consider all of them in the context of each specific case.
The Betamax Case
In 1984, Universal City Studios and other movie studios sued Sony Corporation, the maker of the Betamax video cassette recorder (VCR), claiming that Sony’s VCRs facilitated copyright infringement. The movie studios argued that consumers were using the VCR to record TV shows and movies without permission from the copyright holders, thus violating copyright law.
Sony, on the other hand, argued that the VCR could be used for legitimate purposes, such as time-shifting (recording television programs to watch them at a later time) and that this was a fair use of the copyrighted content, not an infringement. The U.S. Supreme Court ruled in favor of Sony, holding that the use of VCRs to make home recordings for personal, non-commercial purposes was fair use.
I use this case for contrast, because, like Betamax, LLMs are copying entire works to be used at a later time, but LLMs are not doing the copying for personal, non-commercial purposes.
The Betamax case is definitely foreshadowing for the future of LLMs. Videotape was invented in 1954. For 20 years it was used almost exclusively by the television studios for commercial purposes. In 1975, Sony launched the Betamax for $2,300. In 1985, I bought a VCR for $199.
Why am I using this example? Because I am convinced that in the very near future, I will have an LLM on my cell phone. And the more I think about it, LLM apps can’t come soon enough. When it becomes a reality, personal LLMs will actually benefit publishers. You have long heard me claim that in the near future an LLM will become the largest publisher in the world. As Chat GPT and Gemini continue to gravitate users from traditional publishers, the revenue to publishers will shrink. As such, publishers won’t have the resources to maintain a full staff. The amount of content produced by publishers will decrease. At the same time, revenues to LLM technologies will increase, affording them the ability to create their own content. And the cost of content production for LLMs will be significantly less expensive. No need for anything more than bullet point notes for an LLM to create an article, a book or even a movie.
However, if I have an LLM on my phone, I am going to need content that I do not have the capacity to create. I’ll need publishers to create it for my personal use. However, as I just wrote, if personal LLMs aren’t developed more quickly than personal video recorders, I’m afraid that the publishers providing me with the content for my LLM will be the current LLMs.
Copyright Infringement or Parody
Campbell v. Acuff-Rose Music, Inc., a landmark Supreme Court case in 1994 that clarified the application of fair use to parodies. There is only one problem. The Court didn’t understand popular music of the day. 2 Live Crew’s version wasn’t a parody. It was a hard-hitting rap song.
Although never admitted, I am certain that 2 Live Crew’s lawyers were seeking a legal theory to win the case, and they guessed right. They convinced the court that the song was a parody, similar to Weird Al Yankovic’s, “Eat It”, a parody of Michael Jackson’s “Beat It”. Claiming that 2 Live Crew’s “Pretty Woman” was a parody was a virtual guarantee that the court would decide in their favor.
However, who really won the case is a bit nuanced. 2 Live Crew won in principle, a significant victory for artistic freedom in general, but Acuff-Rose Music also got something. By settling, they secured licensing fees for the use of their song, ensuring they would be compensated for its use in the parody. In the end, a partial victory for both sides, except for one important factor, which makes it distinct from current practices of LLMs: 2 Live Crew had to pay to use the original work. For me, this case highlights the complexities of copyright law and the balancing act between protecting creators’ rights and promoting creative expression.
Does this mean that LLMs can legitimately claim fair use, but still have to pay licensing fees to publishers? CLI has compiled a list of the licensing deals between publishers and LLMs, and it’s growing every week. Again, I say this is a slippery slope to extinction for publishers. Remember Brando in on the Waterfront, “Kid, this ain’t you’re night. We’re going for the price on Wilson.” Publishers might be making the “short end money” now, but publishers are unwittingly taking dives.
There’s no question that big publishers are snapping up the big paychecks in exchange for their content. Mid-size and smaller publishers are being left out. From day one, I have been a proponent of small and midsize publishers organizing as a collective to do one of two things. Either be bold and license LLM technology to create their own vertical search platforms, or at a minimum create bargaining power to license their content to LLMs at fair market rates.
AI and Fair Use
The New York Times is suing Sam Altman’s ChatGPT for scraping its content. Altman hasn’t denied the allegations of the NYT. In fact, he calmly admitted that copyrighted works have been used for AI model training, without consent or an offer of compensation. Altman speaks as if fair use is a right, not a defense.
Altman was even confident enough to say, “The New York Times is on the wrong side of this suit.” The truth is that the way fair use has been adjudicated over the past 180 years, I’m not convinced that Sam Altman is wrong. Look, there is no question that I am 100% against LLMs ingesting content that they did not pay for, but under the Copyright Act of 1976, I am not convinced that Open AI is in violation of the statute.
As an IP lawyer, I’d like to offer my own analysis on whether the operation of OpenAI is fair use, using the Copyright Act’s four-part test. Keep in mind that fair use is determined by a balancing test, where no single factor is decisive.
- The purpose and character of the use.
Unlike Jackie Kennedy’s picture, which was used in its original form by New York Magazine, ChatGPT generates text in response to user queries, which is typically transformative in nature because it does not simply reproduce the training data. Rather, it generates new, original text based on the patterns and knowledge learned from a wide variety of sources. In many cases, ChatGPT’s generation of text based on patterns and information from a wide array of sources should be considered transformative, as it produces entirely new and different works that did not directly copy any particular work.
The character of the use can also depend on whether ChatGPT is used commercially or for non-commercial purposes. For example, if ChatGPT is used to provide customer service or for business applications, this could weigh against fair use, as it’s a commercial use. On the other hand, if it’s used for educational, research, or non-profit purposes, it might be more likely to favor fair use.
- The nature of the copyrighted work.
Much of the content that ChatGPT is trained on is likely to include both factual works (such as news articles, technical documents, etc.) and creative works (like books, poems, and films). Courts tend to favor fair use for factual content because it contributes to the public interest and knowledge. However, for creative works (like novels or songs), the fair use argument might be weaker since they are afforded more protection.
The fact that much of the training data may involve creative works means this factor might weigh against fair use, especially when ChatGPT generates text that resembles or references a specific copyrighted creative work.
- The amount and substantiality of the portion used.
This factor could go either way. ChatGPT uses all of the work, including the “heart” of each article. But unlike Folsom, ChatGPT doesn’t publish what it uses. So, in effect, Chat GPT uses none of the content it scrapes from the New York Times. And if there is a direct verbatim copy of language in the derivative works created, it is likely to be a very small amount and not substantial.
Remember, ChatGPT generates original content based on its training, meaning it does not directly quote or reproduce substantial portions of copyrighted material. In most cases, the output is either brief or reworded, so this factor tends to favor fair use.
If ChatGPT were to generate text that closely mirrors the most iconic or central part of a copyrighted work, such as a signature line from a book or song, this factor could weigh against fair use. However, in practice, ChatGPT’s responses tend to provide general information or summaries, which would not likely qualify as taking the “heart” of a copyrighted work.
- The Effect of the Use on the Market for the Original Work.
This is the potential killer for ChatGPT. For two years, I’ve roared that ChatGPT harms the market for publishers like the New York Times and its potential value, including its future marketability. No matter what ChatGPT alleges about its presence in the market, I boldly declare that Chat GPT directly competes with the New York Times and all other publishers from which it takes content.
Whether the use of a copyrighted work harms the market for the original is one of the key considerations in the fourth factor of determining fair use. While ChatGPT does not directly distribute or sell the original copyrighted works it has been trained on, its generated responses absolutely influence the demand for those works.
If a user asks ChatGPT for a specific quote, summary, or answer based on content from copyrighted works (e.g., asking for a summary of a book or information from a copyrighted article), and ChatGPT provides that information without users needing to access the original work, this reduces the need for users to seek out the original work themselves.
In this sense, ChatGPT could be seen as substituting the original content, competing with it directly, and reducing the market for the original works, especially if users trust its answers and don’t verify or look up the original sources.
In the Early Days You Didn’t Trust Google.
The point that users may trust ChatGPT’s outputs without verifying them is crucial. If users view ChatGPT as an authoritative source, which, to some extent, many do, they will be less likely to go back to the original work. This reduces the incentive to purchase, read, or access the original works, which will have a negative impact on the market for those works.
This is especially relevant for creative works, like novels, music, or movies, where the content may not only be used for informational purposes but also for enjoyment or consumption in its entirety. If ChatGPT provides a plot summary or generates text similar to a copyrighted song or poem, users might feel they don’t need to explore the original work, which will harm the copyright holder’s revenue.
Direct competition will be more pronounced in cases where the generated content substitutes for a part of the original work. If ChatGPT produces answers, summaries, or information that users would otherwise get from books, articles, or other copyrighted sources, it reduces the demand for those original sources. It didn’t take very long for Google to be “good enough”. The same has already happened with ChatGPT.
For example, if someone asks for an explanation of a novel’s plot or a historical event and ChatGPT provides a detailed response, the user might not feel the need to buy the book or research the original event further. While ChatGPT claims its purpose isn’t to compete with the original works, the convenience of receiving immediate, generated content absolutely impacts the market for those works.
If my argument holds true, it could indeed weigh against fair use, as market harm is a significant factor. Courts have traditionally considered market harm in terms of how the new use impacts the commercial market for the original work. For instance, if an AI model like ChatGPT reduces the need for users to purchase or interact with the original works, courts may be more likely to view that as harmful to the market for those works. This could weigh against the fair use argument, even if the other factors favor fair use.
My position hasn’t changed. ChatGPT directly competes with original works, especially if its generated content is trusted by users without verification. In such cases, this results in market harm, which should weigh against the fair use argument under the fourth factor of the fair use test.
So, while the other three factors might still lean in favor of fair use (transformative nature, limited use, and non-commercial use), the market effect factor is a critical consideration. Courts will likely need to weigh whether ChatGPT’s operations diminish demand for the original works, particularly when users rely on it as a substitute for accessing those works directly.
This is a complex issue that would likely require careful examination by courts to determine whether the market harm caused by AI-generated content justifies a fair use claim.
Fair Use? Maybe, but it’s still wrong.
About a month ago, Paul Gerbino came up with the brilliant question, “When did fair use cease being an affirmative defense and become an inalienable right?” I jumped on the idea of proving that by examining the history of fair use, LLMs were in violation of the law. When I began to write this article, I was convinced that I would effectively articulate that the scraping of content by LLMs is not fair use. If you’re in doubt, revisit the first half of this article. However, my objective juris prudence and what I consider thoughtful application of the four-factor test against the operational model of LLMs leaves me questioning whether the courts will find that LLMs practices are not fair use.
But that doesn’t mean that what LLMs do is equitable. As an objective broker of morality and honesty (which not all would agree that I am), I know it is wrong to take the property of another without compensation. Maybe what LLMs do is fair use, and therefore legal. But we all know that it is wrong, plain and simple.
The current state of affairs will, without question, lead to the demise of many current publishers and completely change the face of content creation as we know it. Some may argue that this is simply a technological evolution, not unlike what has happened repeatedly over the history of mankind. But Henry Ford didn’t steal horseshoes from blacksmiths to build cars. The essential component to make LLMs valuable is being produced by other creators. And if something doesn’t change quickly, creators of that essential component are going to be eliminated and replaced. Publishers will become the blacksmiths of the 21st century. And the way it will happen is simply wrong.
There’s only one way to fight back. Publishers need to unite and take control of the technology instead of enabling the technology to take control of publishing. I’ve been screaming about it for more than a year. Unfortunately, no one is listening.
Frank Bilotto is a licensed attorney with over 25 years of experience in commercializing intellectual property. He was instrumental in creating The World Reporter in 1999, an alliance of 10,000 daily newspapers, and the first such content alliance in the digital content space. He’s negotiated more than 1,000 intellectual property licenses with the world’s largest organizations, including Comcast, Google, BBC, NewsCorp, Gannett, ESPN, NBC, CBS, ATT, Dow Jones, Thomson Reuters, Facebook, Microsoft, Nike, Adidas, Hewlett Packard, Knight Ridder, Capitol Records, MGM and Paramount. Frank’s passion outside of content licensing is trying to love his neighbor as himself. (Unfortunately, he fails too often.)