OpenAI Training Data to Be Inspected in Authors’ Copyright Cases

For the first time, OpenAI will provide access to its training data for review of whether copyrighted works were used to power its technology.

In a Tuesday filing, authors suing the Sam Altman-led firm and OpenAI indicated that they came to terms on protocols for inspection of the information. They’ll seek details related to the incorporation of their works in training datasets, which could be a battleground in the case that may help establish guardrails for the creation of automated chatbots.

More from The Hollywood Reporter

The agreement stems from a trio of lawsuits initiated by top authors, including Sarah Silverman, Paul Tremblay and Ta-Nehisi Coates, accusing OpenAI of harvesting mass quantities of books across the web, which were then allegedly used to produce infringing answers by ChatGPT. It comes after the court in July dismissed a claim alleging that the company engaged in unfair business practices by utilizing their works without consent or compensation. Previously, U.S. District Judge Araceli Martínez-Olguín also tossed other claims for negligence, unjust enrichment and vicarious copyright infringement, though the writers’ claim for direct copyright infringement remained untouched.

In other cases, AI companies have denied wholesale copying of works. Rather, they’ve argued that training their models involve development of parameters based on those works to define what things look like and how they should be constructed. OpenAI may advance that defense at a later stage of the authors’ case, as well as arguments that the practice of using published works to train its system constitutes fair use, which provides protection for the use of copyrighted material to make a secondary work as long as it’s “transformative.”

OpenAI has said that it trains its model on “large, publicly available datasets that include copyrighted works.” Last year, it pivoted to no longer disclosing those materials in an attempt to maintain an advantage over competitors and sidestep legal liability. While it remains unknown which works were used, the authors pointed to ChatGPT generating summaries and in-depth analyses of the themes in their novels. They claimed that the company downloaded hundreds of thousands of books from shadow library sites to train its AI system.

Under the agreement, the training datasets will be made available at OpenAI’s San Francisco office on a secured computer without internet or network access. Any person who’ll review the information will be required to sign a non-disclosure agreement, sign a visitor’s log and provide identification.

Use of any kind of technology will be severely restricted. No recording devices, including computers, cell phones or camera, will be allowed into the inspection room, per the joint stipulation. OpenAI may provide limited use of a computer to take notes, with lawyers for the authors copying those notes onto another device under the supervision of representatives for the company at the end of each day. No copies of any portion of the training data will be allowed.

“The Inspecting Party’s counsel and/or experts may take handwritten notes or electronic notes on the provided note-taking computer in scratch files, but may not copy any Training Data itself into any notes,” the filing states.

Lawyers at the Joseph Saveri Law Firm are spearheading the litigation. They also represent authors in identical copyright lawsuits against Meta. In those cases, fact discovery is slated to end on Sept. 30, though a request for an extension has been filed. U.S. District Judge Vince Chhabria at a hearing on Friday questioned whether the attorneys can adequately represent the writers.

“It’s very clear to me from the papers, from the docket and from talking to the magistrate judge that you have brought this case and you have not done your job to advance it,” Chhabria said, according to Politico. “You and your team have barely been litigating the case. That’s obvious… This is not your typical proposed class action. This is an important case. It’s an important societal issue. It’s important for your clients.”

The concern stemmed in part from the lawyers’ failure to conduct any depositions in the case.

“It is sometimes said that timing is everything. Well, it turns out that’s true for bad timing as well,” wrote U.S. District Judge Thomas Hixson. “Plaintiffs request that the Court allow them to take 35 party depositions, exclusive of third-party depositions, or in the alternative they request a total of 180 hours of deposition testimony. And they made that request … 18 days before the current close of fact discovery.”

The judge added, “Since Plaintiffs have taken zero depositions, the 35 party depositions (plus non-party depositions), or alternatively the 180 hours of deposition testimony, would all have to occur in the second half of September, which is obviously impossible.”

Best of The Hollywood Reporter

Sign up for THR's Newsletter. For the latest news, follow us on Facebook, Twitter, and Instagram.