Why The New York Times' lawyers are inspecting OpenAI's code in a secretive room
- Lawyers for The New York Times are poring through ChatGPT's source code and training material.
- Copyright cases from publishers and authors are trying to figure out how AI trains on creative work.
- The lawsuits could chart a path forward, much as Napster's legal morass did two decades ago.
Somewhere in the United States, in a secure room, on a computer unconnected to the internet, sits the source code for ChatGPT.
It is there to be inspected by lawyers for The New York Times.
By the order of a federal judge, the lawyers can only get into the room if they show a government-issued ID to a security guard. They are forbidden from bringing in their own phones, flash drives, or any other electronic devices. They're given a computer — also unconnected from the internet — with a word processing program. After each session, their notes can be downloaded to a different computer, and then the original note-taking computer may be wiped.
The Times' lawyers can share their notes with up to five outside consultants to help them understand what the code does. If one of the lawyers wants to show OpenAI CEO Sam Altman a snippet of the code to ask him questions about it for a deposition, that copy will be destroyed afterward.
OpenAI is worth $157 billion largely because of the success of ChatGPT. But to build the chatbot, the company trained its models on vast quantities of text it didn't pay a penny for.
That text includes stories from The New York Times, articles from other publications, and an untold number of copyrighted books.
The examination of the code for ChatGPT, as well as for Microsoft's artificial intelligence models built using OpenAI's technology, is crucial for the copyright infringement lawsuits against the two companies.
Publishers and artists have filed about two dozen major copyright lawsuits against generative AI companies. They are out for blood, demanding a slice of the economic pie that made OpenAI the dominant player in the industry and which pushed Microsoft's valuation beyond $3 trillion. Judges deciding those cases may carve out the legal parameters for how large language models are trained in the US.
"Developers should pay for the valuable publisher content that is used to create and operate their products," a Times spokesperson told BI. "The future success of this technology need not come at the expense of journalistic institutions."
For the lawsuit, the 173-year-old media company employed an elite law firm, Susman Godfrey, which recently won Dominion's mammoth $787.5 million settlement from Fox News. Other lawsuits from newsrooms, including The New York Daily News and Mother Jones, have latched on to the case.
Susman Godfrey is also representing a group of authors including George RR Martin, Jodi Picoult, and Ta-Nehisi Coates, who filed copyright claims months before the Times. If a judge certifies their class-action status, an eventual settlement or judgment could have ramifications for virtually every author and artist whose work has been used to train AI models.
On September 12, dozens of lawyers from the tech and journalism companies packed into a magistrate judge's courtroom in lower Manhattan to figure out the best way to divvy up the discovery process, including inspecting ChatGPT's code and training data. Along with the authors' lawyers, they are still deciding who they can depose and how to schedule the depositions.
"It's as thrilling as things get for law professors who work in copyright," said Kristelia García, an intellectual property law professor at Georgetown University Law.
Setting the rules
With Congress taking a backseat on AI regulation, the industry expects courts to set — or, they hope, not set — the rules.
Many publishers, including Business Insider owner Axel Springer, have struck deals with generative AI companies to share their content for LLM training.
The scope and resources of the Times' lawsuit make it a likely candidate for a precedent-setting Supreme Court. Lawyers are also looking at class action lawsuits from authors, as well a music industry case against Anthropic, as ones to watch.
"The New York Times is a journalistic juggernaut," García said. "It's big, it has a lot of content. More importantly, perhaps, it has a lot of market power behind that content."
The lawsuit argues OpenAI infringed on its intellectual property in two ways.
There is the "input" case — alleging that the LLM illegally hoovered up over 10 million New York Times articles to train ChatGPT and Microsoft Copilot without compensation. And the "output" case — arguing that when asked, ChatGPT can spit out a New York Times article that readers would otherwise pay a subscription for.
In court filings, lawyers have repeatedly cited Napster, which illegally copied millions of songs and made them available for free. OpenAI similarly used high-quality, well-researched, well-written, and fact-based New York Times articles to make ChatGPT so impressive, the Times argues.
If anything, OpenAI is worse, according to Justin Nelson, a Susman Godfrey attorney representing authors in a class action lawsuit running parallel to Times' case and in a similar case against Anthropic.
Napster was a project from college kids; OpenAI is backed by Microsoft and already worth billions.
"Instead of kids, it was a sophisticated company," Nelson told BI. "And instead of doing it for their own personal use, they were doing it for commercial gain.
Representatives for OpenAI and Microsoft didn't respond to requests for comment from Business Insider. In court, they argue the legal doctrine of "fair use" protects how their models ingest the articles. The ChatGPT outputs with near-verbatim copies of Times articles were "highly anomalous" results that aren't representative of how the app is used, they say.
Napster was sued out of existence, but it inspired the music industry to adopt MP3s and, eventually, streaming — now used for everything from video games to movies. Spotify cofounder Daniel Ek has cited Napster as an inspiration, and Napster cofounder Sean Parker has praised Spotify as a successor.
Copyright lawsuits from journalism organizations may set the pace for all AI generators, predicted García, who worked in the music industry for a decade. AI isn't particularly good at generating movies or doing reporting, but it can convincingly mimic journalism.
"Journalism is kind of the canary in the coal mine," García said. "In the same way that music was the canary back in the Napster days, because people could easily torrent an MP3. But you couldn't, at that time, easily torrent a film."
Given the sheer number of people involved, the authors' lawsuits could have an even more dramatic effect. A settlement or judgment could change business models.
"People get creative in class action settlements," said Matthew Sag, an Emory University law professor studying copyright law and artificial intelligence. "You could cut the authors of America in for a percentage of stock or something."
The source code
The nature of generative AI technology itself lies at the heart of the copyright disputes.
What actually happens when a large language model "learns" a book or news article? What about when ChatGPT digs through the model to answer a query? Does the process make a "copy" in any meaningful sense of the word? Or is the training data just part of a big slurry of ones and zeros that no longer meaningfully resemble specific works?
The lawyers and consultants poring through ChatGPT's code are trying to answer those questions. They are also examining the LLM training data and plan to ask key OpenAI executives and programmers — under oath — how the models are meant to work.
Once the code is read and depositions taken, the parties will be in a better position to argue about "fair use," a notoriously tricky legal doctrine that protects the use of "transformative" creations derived from copyrighted material.
If OpenAI really is making copies of books and news articles, Napster-style, then is its training process sufficiently transformative to be considered "fair use"? Judges across the country are "all over the map" in deciding fair use copyright cases, according to Christa Laser, an intellectual property law professor at Cleveland State University, setting up high and unpredictable stakes.
"I think that's going to be the big question at the end of the day that's going to go all the way up to the Supreme Court," Laser told BI. "That question of fair use around training data, ingesting and training."
A key "fair use" question is whether ChatGPT's creations compete with the original journalistic works — an urgent issue for news organizations.
"The news publishers are the first to bring these big suits because they have more on the line," García said.
To make a copyright claim, a plaintiff can't just point to a corpus of work used as inspiration. It needs to point to a specific work they say has been copied.
In its lawsuit, The New York Times attached tens of thousands of pages of exhibits tabulating 10,553,897 articles. It says OpenAI and Microsoft illegally violated the copyrights for each of them.
Among those articles is a 2001 story, shortly after an appellate court ruled against Napster, where a journalist asked users about what they'd do. They all agreed there was no going back.
''If Napster does shut down, there are more sites out there,'' one user told the reporter. ''And they may get a few, but they can't stop all of them.''