The underlying dispute centers on whether Databricks and MosaicML infringed authors' copyrights by using their books to build and develop LLMs. The plaintiffs allege that Databricks copied their works from datasets known as The Pile and RedPajama-Books during experiments and tests conducted as part of developing MPT2, a next-generation LLM that was later renamed DBRX. They also point to statements by Databricks employees — including an executive's statement that Databricks used a dataset containing plaintiffs' works to "fine-tune models" — as supporting the inference that their works were used in connection with DBRX.

Databricks and MosaicML moved to dismiss the direct infringement claim and to strike all allegations relating to DBRX, arguing that because plaintiffs did not allege copying in DBRX's final training dataset, the infringement allegations were too attenuated from the product itself. The defendants contended that prior research using a books dataset before training DBRX does not establish that the dataset was used in DBRX.

Judge Charles R. Breyer of the Northern District of California denied both motions. The court held that the complaint sufficiently tied the alleged copying to DBRX, noting that the plaintiffs described specific experiments — including pretraining with and without a books dataset to evaluate long-context results — and that the model from those experiments was later renamed DBRX. The court also noted that employee statements, read in context alongside more direct allegations, provided supporting inferences.

On the attenuation question, the court held that properly determining the degree of connection between the alleged copying and the DBRX product would require evidentiary considerations outside the pleadings, making dismissal inappropriate at this stage. The court acknowledged that defendants may ultimately prevail on the issue but concluded that plaintiffs' allegations were sufficient to proceed.

The case is In Re Mosaic LLM Litigation, No. 3:24-cv-01451, in the Northern District of California. This ruling addresses the second amended consolidated complaint; the court had previously dismissed an earlier version of the DBRX direct infringement claim but noted that plaintiffs could move to amend if discovery revealed information supporting factual allegations for that claim.