If a tree falls in the woods…
For the past three years, I have been complaining, warning and predicting what AI was going to do to traditional publishers around the globe. BTW: In his article for CLI’s next issue, Paul Gerbino will cite video evidence that I was way ahead of the curve. However, unlike most doomsayers, I’ve been providing the prescription of exactly what publishers should have done and should be doing right now to survive, and in fact, thrive in the Jetson’s version of content creation.
“There will never be an AI-generated article in any of our publications.” Neil Vogel 2021
I wonder if Dot Dash Merideth’s CEO, Neil Vogel, still believes that an AI generated article will never appear in any of their publications. BTW: I’m the one that said when I read it that if Dotdash Meridith didn’t start using generative AI to write articles, some yet to be determined LLM would challenge for the title of being the largest publisher in the world.
Initially, I thought it would take 10 years for AI to take over publishing. Now I’m cutting that prediction in half, and that’s largely because Dotdash Meridith gave away the company’s assets for roughly $16 million per year. I know that’s the fixed component of the deal, but unless the variable component can be up to the value of the company, which is north of $4 billion, Dotdash Meridith got slickered. (As a kid, I liked the Beverly Hillbillies and love to use the term “slickered” whenever I can.)
BTW: It’s a shame Mr. Vogel didn’t read my articles or call me, because I certainly could have negotiated a significantly more favorable deal, which more closely matched the value of the company, because that is exactly what Dotdash Meridith sold to Open AI. What Mr. Vogel probably still doesn’t realize is that Open AI isn’t interested in licensing his content in perpetuity. Open AI will license content just long enough until it can generate its own articles at significantly less cost, by simply hiring intern level “wannabe” journalists (and nobody else) to be the eyes and ears of Chat GPT. The technology will do the rest.
The story is the same for Robert Thomson, CEO of Newscorp. It is without question that this man is smarter and significantly more successful than I’ll ever be. But he sold a $16 Billion company for $250 million, spread over 5 years to guess who? Open AI. And there is no way that Open AI intends to renew that agreement. It makes me wonder if Mr. Thomson understands the power of ChatGPT or if he’s just trying to improve Newscorp’s financials for the next quarter, knowing that he’ll be enjoying his retirement on a tropical paradise when Newscorp files for Chapter 11, victimized by shrunken revenues and no longer able to afford to maintain the staff to produce the content for its publications. I find it hard to believe that he doesn’t know that ChatGPT’s LLM will be learning everything it needs to about the financial markets from the archives of Newscorp’s flagship publication, The Wall Street Journal. Soon, OpenAI will be hiring eyes and ears to report on Wall Street, and presto, ChatGPT will be covering Wall Street with indepth reporting equal to or better than the best journalists at the Journal.
Newscorp’s decision is even more surprising, because Rupert Murdoch was one of the very few publishing moguls who never bought into the idea of putting his content on the internet in exchange for the promise of advertising dollars. The Wall Street Journal content remained behind a pay wall. When advertising dollars never materialized and other major newspapers were desperately attempting to put the jeanie back in the bottle, the Wall Street Journal never missed a beat. Best estimates are the Wall Street Journal’s annual revenues exceed $2 Billion. But they risked it all it all for $50 Million per year over 5 years. If Fred Sanford could talk to Thomson, he’d simply say, “You Big Dummy.”
But Wait…There’s More
So, I’m opinionated, certain of my conclusions, and not afraid to state them publicly. I thought I had exhausted the subject, and had nothing left to say about how AI is impacting publishers. Then I received a call from half way around the world that caught me unprepared. It seems there is a category of publishing that I completely overlooked that may feel the impact of AI more than any other type of publisher. A book publisher wanted my opinion, and I’m embarrassed to say, I had nothing for her.
In the beginning…
Let’s get back to where the problem originated.
In its manifesto, Google proclaimed that it would organize the world’s information, part of which was the “Google Books Project”, launched in 2004. The goal was to digitize books from libraries around the world and make them searchable online, including the libraries at Harvard, University of Michigan, Stanford and Oxford.
The project was announced at a time when Google was issuing weekly press releases of new projects, in lieu of a traditional ad campaign. Most of the projects never got started, let alone completed. But I must say that the strategy worked. Google was always in the news. At the time, I said Google will not be remembered as a great technology company, but a great marketing company. Be that as it may, I was convinced that the Book Project was just another attention grabbing media strategy. As most of you know, I was dead wrong.
As of the latest available data, Google has digitized over 40 million books through the project. This includes books from a wide range of genres and fields, and the collection is constantly growing, as new books are added.
So, What’s the Problem?
The problem for book publishers is that unlike newspaper, journal and magazine articles, books have a long shelf life. I just read “To Kill a Mockingbird” for the first time last week, (I can hear you snickering). Until last week, I relied on Gregory Peck and those two great child actors to tell the story to me. The point is that the book was written in the year I was born. Besides the election of John F. Kennedy, the Greensboro lunch counter sit ins, and Bill Mazeroski’s walk-off homer in Game 7 of the World Series, I doubt that many more of the thousands, if not millions of other articles written that year have any value today. However, “To Kill a Mockingbird” still sells about one million copies per year.
Did I make my point? Book publishers have a lot more to lose than daily, weekly or monthly publications do.
The Trojan Horse Strategy
Google argues that its goal has been to make books more accessible by creating a searchable database containing all of them. An important statistic that I learned from Chat GPT, which I verified with 2 original sources, is that less than 8 million books in the collection are in the public domain. When the copyright expires, a book is in the public domain. As such, they are fully accessible on Google Books. Users can read them in their entirety, download them in various formats (like PDF or ePub), and in most cases, even print them.
However, that means more than 30 million books in Google Books are still under copyright protection. Google searches return only limited previews or snippets, which Google characterizes as fair use. Users can search for specific terms within these books, and if the book is available for purchase or through a library, they may be directed to a purchasing platform or library partner.
Additionally, Google maintains its own digital library of the scanned books. This library is where users can browse, preview, or search for books. The library interface allows users to explore different categories, view recommended titles, and access books based on specific topics or genres. (Spoiler alert – LLMs can do the same thing in a Google search box as a human can.)
Google allowed publishers of books under copyright protection to set the extent of what it could show. In some cases, only a few pages or snippets are shown, while in others, users might see entire chapters or sections, depending on what the publisher has allowed.
Similar to the promise that Google initially made to publishers to allow their websites to be indexed, Google promised book publishers increased sales. So, millions of publishers and authors contribute their books to the Google Books project. By doing so, books can be made available for preview, discovery, and, in some cases, sale through Google Play Books.
The Google Books project has faced legal challenges, especially around copyright and fair use. One of the major cases was the Google Books Settlement (2008-2015), where publishers and authors contested Google’s practice of digitizing books without express permission. The settlement was eventually modified and impacted how Google would work with copyrighted books.
Despite these challenges, Google continues to expand its library of digitized works, and while it is without question a groundbreaking initiative for democratizing access to knowledge, it’s not fair to the creators of the content.
Google defended its actions by arguing that its digitization project fell under the doctrine of fair use. Fair use allows for the limited use of copyrighted material without permission in certain circumstances, such as for purposes of criticism, commentary, research, or education. Google argued that its use of copyrighted material for the purpose of indexing and searching was transformative and aligned with fair use principles.
The Google Books Settlement case ended when Google proposed a settlement with the plaintiffs, which would have allowed Google to continue digitizing and offering books online while compensating authors and publishers. Under this proposed agreement, Google would create a Book Rights Registry, a system to manage the distribution of revenues to copyright holders based on their books’ usage.
While the Google Books Settlement aimed to address these issues, it was never fully implemented, and Google pivoted to a new model in which it collaborates with copyright holders through the Partner Program.The proposed settlement was highly controversial. Many authors, particularly independent authors, argued that the settlement disproportionately favored Google and large publishers, while failing to provide fair compensation to smaller authors.
So, what does all of this have to do with AI?
Everything!
When Google was scanning 40 million books, Bard and later Gemini weren’t a consideration. The cases and controversies were all focused on what Google was permitted to show in its search results.
But now, the books that Google has digitized through its Google Books project are not directly part of Google’s Gemini LLM, but here is a quote directly from Google: “There is a relationship between the two technologies in terms of data use and potential overlap in the broader ecosystem of Google’s artificial intelligence (AI) and machine learning efforts.” So, the books are not directly integrated into Gemini, but they could indirectly inform other aspects of Google’s AI systems.
Gemini is Google’s next-generation AI model designed to power many of its AI-driven services. Gemini is a suite of models developed to handle tasks such as natural language understanding, generation, reasoning, and multimodal capabilities, including text, image, and video inputs.
Gemini and other LLMs are trained on a massive amount of textual data from a wide range of sources—this can include web pages, research papers, articles, other publicly available content and books. However, the specific datasets used to train Gemini have not been fully disclosed by Google. It is likely that portions of publicly available books, including those from the public domain that Google has digitized could be part of the broader text corpora used to train such models, but Google does not specifically disclose whether Google Books content is directly used in Gemini’s training.
The digitized books in Google Books, especially those in the public domain, could be used as part of the broader data set from which AI models might learn language patterns, context, and knowledge. However, access to these books is managed via Google Books, with copyright protections in place for non-public domain works.
What is upside down in this controversy is that the burden should be on Google to ensure that the data used for Gemini’s training does not violate copyright laws. Books that are under copyright protection should not be used to train its LLM, unless Google can verify that it has permission from the copyright holders, either through licensing agreements or publicly available datasets. This is all a mystery that Agatha Christie could never solve.
While there is a connection between the vast amount of data available from Google Books and the training of AI models like Gemini, it is not clear if specific copyrighted books are directly used for training the Gemini LLM.
Intermission
I stopped writing for the past two hours, because I’ve been asking Gemini complex questions about the plots and characters of books with which I am intimately familiar, with the intent of determining whether Gemini’s answers come from the full text of the books being stored in their LLM or if Gemini uses RAG to access Google Books. Candidly, after 50 searches, I have no idea.
After losing critical time, (The article is due at CLI headquarters in 4 hours,) I realize now that it doesn’t even matter. It is irrelevant whether the digitized books are stored in the LLM or if Gemini uses RAG to access to the books sitting in Google Books through the Google search engine. Either way, it serves the same end. And once the content is accessed, the machine “learns” the content. It can’t be deleted and is part of Gemini for future reference for as long as Gemini exists.
Show Me the Money
One of the main concerns for book publishers and authors is the use of copyrighted works to train AI systems, including large language models, like Gemini. In many cases, these AI models are trained on vast amounts of publicly available text data but it is unclear whether that data was legally sourced or whether publishers and authors were compensated.
Book publishers have expressed concern about how their copyrighted content might be used to train AI models without proper compensation. If Google’s Gemini model, for example, was trained on books or parts of books that are copyrighted, publishers might argue that this constitutes unauthorized use of their works, similar to how Google faced lawsuits for digitizing books without permission through the Google Books project.
It’s Coming
As of 2023 and 2024, there have been several claims that OpenAI might have violated copyright laws by using copyrighted texts to train its models. Some authors and publishers have expressed concerns that LLMs were trained on their works without consent, and in some cases, without appropriate compensation. This could lead to legal challenges involving whether the use of text for training purposes is permissible under fair use.
Both Bertelsmann (through Penguin Random House) and Hachette have expressed concerns over how AI models and Large Language Models (LLMs), such as ChatGPT, use copyrighted books, articles, and other texts for training purposes. The concern is that these AI systems might use large swaths of copyrighted material without licensing or compensation for the authors and publishers who hold the copyright. This concern could potentially lead to lawsuits against companies creating and using these models.
As the AI/LLM field expands, there has been increasing scrutiny over whether text-based models are violating copyright laws by training on content without permission from authors or publishers. Given that Bertelsmann’s Penguin Random House and Hachette control large swathes of copyrighted works, the publishers’ concerns revolve around whether their works are being used improperly to train LLMs.
The elephant in the room is that there is no transparency regarding which data was used to train LLMs. If copyrighted works were included in the training sets without consent, this could be a violation of copyright laws.
Some companies may argue that the models are using publicly available data to train without the need to get permission, but as Paul Gerbino often asserts, “Just because content is accessible on the web doesn’t mean it’s free to use for commercial purposes.”
Solutions
The short term solution is simple. Money. LLMs need to pay publishers and authors for using their content to train their LLMs.
The long term solution is more complex. We need to reform copyright laws to address the issues surrounding AI and machine learning. It’s not news for me to say that copyright laws are outdated and need to be updated to handle the complexity of AI training and usage of copyrighted material. But don’t hold your breath.
It’s 4:30 am. No pithy conclusion. I’ve made my point. I’m going to bed.
But don’t forget, it doesn’t matter if LLM’s store the books or if they use RAG to create answers. Publishers and authors will feel the negative effects if their not properly compensated. As Walter Cronkite said every night, “And that’s the way it is.”
OK, one pithy comment:
“A room without books is like a body without a soul.” – Marcus Tullius Cicero.
Frank Bilotto is a licensed attorney with over 25 years of experience in commercializing intellectual property. He was instrumental in creating The World Reporter in 1999, an alliance of 10,000 daily newspapers, and the first such content alliance in the digital content space. He’s negotiated more than 1,000 intellectual property licenses with the world’s largest organizations, including Comcast, Google, BBC, NewsCorp, Gannett, ESPN, NBC, CBS, ATT, Dow Jones, Thomson Reuters, Facebook, Microsoft, Nike, Adidas, Hewlett Packard, Knight Ridder, Capitol Records, MGM and Paramount. Frank’s passion outside of content licensing is trying to love his neighbor as himself.