Grounded for Good: The Case for News Data Licensing in Generative AI Grounding Layers

by Jay Krall

Jay Krall discusses the legal and regulatory challenges surrounding the use of copyrighted works in AI training. He covers the hurdles that lawsuits of copyright infringement are facing in courts, and how the EU may be watering down AI regulations under the pressure from the tech industry. But there is hope. Despite the challenges, Jay suggests that this technology revolution doesn’t have to follow the pattern of the search engine and social media revolutions the media has experienced.

When a commercial AI product learns from copyrighted works, do its builders owe royalties to rights holders? The raft of lawsuits against US technology companies over this issue is beginning to face legal headwinds, with courts and regulators rejecting infringement claims. But publishers may be focusing their enforcement and monetization efforts in the wrong areas of AI development. 

Recently, Judge Vince Chhabria of the United States District Court for the Northern District of California batted down key assertions in memoir author and comedian Sarah Silverman’s suit against Meta Platforms Inc. “There is no way to understand the LLaMA models themselves as a recasting or adaptation of any of the plaintiffs’ books,” Chhabria wrote, referring to the Facebook parent company’s generative AI project. Copyright infringement relies on actual similarity to a copyrighted work, he wrote. For AI foundation models like OpenAI ChatGPT, trained on datasets of trillions of words and able to respond to complex user prompts, the use of copyrighted material in the training of these models alone, without substantial similarity between the owner’s work and an AI’s output, doesn’t constitute infringement, according to his ruling. 

In addition to adverse outcomes in US courts, copyright holders are also witnessing the watering down of the EU AI Act’s draft rules requiring disclosure of copyrighted works in AI, as the legislation nears a final version, under immense pressure from the tech industry. The AI Act is now likely to require only that AI builders obey the existing EU Copyright Directive, without giving publishers a way to find out when their content is used in AI training.

We have all seen this movie before, right? A disruptive, exciting and scary new technology is introduced. Governments compete to attract investment with incentives and low regulation. Like social media, AI regulation faces what American media scholar Ethan Zuckerman called the “cute cat problem”: governments face electoral repercussions when they restrict popular apps. So far, there are no signs that publishers will get much help from regulators around content usage rights in AI.

Nonetheless, there’s no reason to assume this technology revolution needs to follow the pattern established by the search engine and social media revolutions, where news publishers failed to capture a significant share of the economic growth driven by innovations. Breaking the pattern requires new partnership strategies, and a new mindset about the role of high-quality newsgathering in the technology ecosystem. 

When billions of consumers adopted Google and Facebook products, businesses had to adapt their marketing strategies. But compared to businesses, consumers have low requirements for information quality. In particular, Facebook’s ability to dissolve its news partnerships and simply shift audiences to individual content creators, demonstrated the limits of modern consumer interest in news. Businesses are very different in this regard, and unlike the search and social waves, generative AI is transforming businesses internally, not only in response to external consumer behavior.  

Companies across every industry have long relied on their ability to license news as an input to their data systems and decision-making processes. In the media monitoring and business intelligence product spaces, where I’ve spent the past 17 years, organizations like LexisNexis, Dow Jones Factiva, and national copyright collectives like the Newspaper Licencing Agency in the UK, collect revenues from software builders and redistribute them to publishers. Now administered primarily in the form of APIs and data feeds, these products and licensing structures have their roots in “corporate library” archives of the 1980s, when most large companies maintained their own periodical reference collections for researching potential products, market conditions and competitors. Media monitoring is an estimated $3.5 billion global market.

That product space is quickly being swallowed by generative AI. In 2023, nearly all major media monitoring and social listening applications have integrated ChatGPT or an equivalent. For the moment, these AI “assistants” or “copilots” summarize the news articles and social media posts served to them through news licensing collectives, though some licensors prohibit automated summarization in their usage terms. Compared to the painstaking task of reading and tagging content by hand, AI already offers such a compelling alternative that some monitoring firms have reduced the size of their analyst teams as they reap efficiency gains. 

This quick adoption of generative AI in business analytics changes the value proposition of the software applications that have long served a wide variety of business research use cases. AI models need these tools for their data access, which often extends beyond what’s available on the public Web, particularly at low latency. AI foundation models have “knowledge cutoffs” which means they aren’t trained on recent events. Bridging this gap, what was once your “media intelligence platform” may now become simply a backend data access point for generative AI, your application’s charts and content lists quickly displaced by a continuous chatlog-style narration of your media environment. No one really wanted to puzzle over those charts anyway. 

To incorporate recent information into an AI response, foundation models also have what are known as “grounding layers”. According to a recent News Media Alliance white paper, “the output of large language models can be extended to encompass potentially up-to-the-minute information that was not included in their training sets by using real-time search results as context for their responses. This method, known as ‘grounding’, is employed by generative AI-based applications, such as Microsoft’s Bing Chat, OpenAI’s ChatGPT-Plus, Anthropic’s Claude-2, and Google’s Search Generative Experience.”

The components of a competent AI system go beyond its foundation model. In order to be useful to businesses making decisions in near real time, AI will need access to real-world “sensors”: that is, news organizations. It’s in the grounding layer, not the training of a foundation model, that news becomes a fundamental ingredient to the success of new forms of AI. From investment research to national defense and cybersecurity, no high-stakes technology stack operates today without news content to process as a data input in its decisions. AI won’t change that anytime soon, simply because there’s no better way to gather information about events as they happen, than working with major news organizations. 

As generative AI automates large swathes of data analysis work, publishers will be incentivized to protect their content with paywalls, and license it to conscientious technology builders who are willing to be transparent about their content usage. AI builders will win B2B markets on the quality of the information they process, while consumer-facing applications compete on style, persona development and ultimately, user intimacy. As ever, it’s business applications where news proves most essential as a data product. 

Generative AI Glossary

Foundation models – Generative AI models capable of general-purpose conversation and simple tasks. Also referred to as large language models (LLMs). 

Grounding layer – The component of a general purpose AI system that gathers context from external data sources, which may be updated in real time, to generate better responses to user prompts than a foundation model alone can perform without external signals. 

AI assistant – A custom implementation of a foundation model which gathers information from an external data source, such as a real-time news feed, to improve or update an AI response. 

About Jay Krall

As a Wall Street Journal food industry reporter, Jay gathered public opinion on foot, from people in grocery stores and restaurants. Since 2008, he has managed software products which help businesses understand public opinion from billions of social media discussions and news articles.

Jay led the development of audience data products for Facebook and Reddit, and delivered open-data analytics and custom AI projects for Fortune 100 clients. He has consulted on data strategy for firms specializing in investment banking, digital advertising and security. Most recently, he led data partnerships for London Stock Exchange-traded Access Intelligence Plc.