FAIR USE

TRAINING on SHADOW LIBRARIES
https://en.wikipedia.org/wiki/The_Pile_(dataset)
https://torrentfreak.com/u-s-court-order-against-annas-archive
https://torrentfreak.com/nvidia-annas-archive-millions-of-pirated-books
NVIDIA Contacted Anna’s Archive to Secure Access to Millions of Pirated Books
by Ernesto Van der Sar

“NVIDIA executives allegedly authorized the use of millions of pirated books from Anna’s Archive to fuel its AI training. In an expanded class-action lawsuit that cites internal NVIDIA documents, several book authors claim that the trillion-dollar company directly reached out to Anna’s Archive, seeking high-speed access to the shadow library data. Chip giant NVIDIA has been one of the main financial beneficiaries in the artificial intelligence boom. Revenue surged due to high demand for its AI-learning chips and data center services, and the end doesn’t appear to be in sight. Besides selling the most sought-after hardware,

NVIDIA is also developing its own models, including NeMo, Retro-48B, InstructRetro, and Megatron. These are trained using their own hardware and with help from large text libraries, much like other tech giants do. Like other tech companies, NVIDIA has also seen significant legal pushback from copyright holders in response to its training methods. This includes authors, who, in various lawsuits, accused tech companies of training their models on pirated books. In early 2024, for example, several authors sued NVIDIA over alleged copyright infringement.

Through the class action lawsuit, they claimed that the company’s AI models were trained on the Books3 dataset that included copyrighted works taken from the ‘pirate’ site Bibliotik. Since this happened without permission, the authors demanded compensation. In response, NVIDIA defended its actions as fair use, noting that books are nothing more than statistical correlations to its AI models. However, the allegations didn’t go away. On the contrary, the plaintiffs found more evidence during discovery.

Last Friday, the authors filed an amended complaint that significantly expands the scope of the lawsuit. In addition to adding more books, authors, and AI models, it also includes broader “shadow library” claims and allegations. The authors, including Abdi Nazemian, now cite various internal Nvidia emails and documents, suggesting that the company willingly downloaded millions of copyrighted books. The new complaint alleges that “competitive pressures drove NVIDIA to piracy”, which allegedly included collaborating with the controversial Anna’s Archive library.

According to the amended complaint, a member of Nvidia’s data strategy team reached out to Anna’s Archive to find out what the pirate library could offer the trillion-dollar company. “Desperate for books, NVIDIA contacted Anna’s Archive—the largest and most brazen of the remaining shadow libraries—about acquiring its millions of pirated materials and ‘including Anna’s Archive in pre-training data for our LLMs’,” the complaint notes. “Because Anna’s Archive charged tens of thousands of dollars for ‘high-speed access’ to its pirated collections […] NVIDIA sought to find out what “high-speed access” to the data would look like.”

According to the complaint, Anna’s Archive then warned Nvidia that its library was illegally acquired and maintained. Because the site previously wasted time on other AI companies, the pirate library asked NVIDIA executives if they had internal permission to move forward. This permission was allegedly granted within a week, after which Anna’s Archive provided the chip giant with access to its pirated books. “Within a week of contacting Anna’s Archive, and days after being warned by Anna’s Archive of the illegal nature of their collections, NVIDIA management gave ‘the green light’ to proceed with the piracy. Anna’s Archive offered NVIDIA millions of pirated copyrighted books.”

The complaint states that Anna’s Archive promised to provide NVIDIA with access to roughly 500 terabytes of data. This included millions of books that are usually only accessible through Internet Archive’s digital lending system, which itself has been targeted in court. The complaint does not explicitly mention whether NVIDIA ended up paying Anna’s Archive for access to the data. Additionally, it’s worth mentioning that NVIDIA also stands accused of using other pirated sources. In addition to the previously included Books3 database, the new complaint also alleges that the company downloaded books from LibGen, Sci-Hub, and Z-Library.

In addition to downloading and using pirated books for its own AI training, the authors allege NVIDIA distributed scripts and tools that allowed its corporate customers to automatically download “The Pile“, which contains the Books3 pirated dataset. These allegations lead to new claims of vicarious and contributory infringement, alleging that NVIDIA generated revenue from customers by facilitating access to these pirated datasets. Based on these and other claims, the authors request to be compensated for the damages they suffered. This applies to the named authors, but also to potentially hundreds of others who may later join the class action lawsuit. As far as we know, this is the first time that correspondence between a major U.S. tech company and Anna’s Archive was revealed in public. This will only raise the profile of the pirate library, which just lost several domain names, even further.”

“A copy of the first consolidated and amended complaint, filed at the U.S. District Court for the Northern District of California, is available here (pdf). The named authors include Abdi Nazemian, Brian Keene, Stewart O’Nan, Andre Dubus III, and Susan Orlean.”

ROBOT BOOK CRITICS DID NOT ENJOY YOUR STATISTICAL CORRELATIONS
https://cnet.com/meta-won-fair-use-lawsuit-but-judge-says-authors-likely-to-win
https://reuters.com/copyright-battles-enter-pivotal-year-us-courts-weigh-fair-use
https://torrentfreak.com/copyrighted-books-are-statistical-correlations-to-ai-models
NVIDIA: Copyrighted Books Are Just Statistical Correlations to Our AI Models
by Ernesto Van der Sar  /  August 17, 2024

“NVIDIA sits front and center of the AI boom. The company provides the much-needed chips and offers its own AI models. NVIDIA admittedly used pirated books to train these models, which triggered a copyright infringement lawsuit. This week, the company informed the court that these claims fall flat, arguing that copyrighted books are nothing more than statistical correlations to its AI models. Over the past two years, AI developments have progressed at a rapid pace. This includes large language models, which are typically trained on a broad datasets of texts; the more, the better. When AI hit the mainstream, it became apparent that rightsholders are not always pleased that their works were used to train AI.

This applies to photographers, artists, music companies, journalists, and authors, some of whom formed groups to file copyright infringement lawsuits to protect their rights. Book authors, in particular, complained about the use of pirated books as training material. In various lawsuits, companies including OpenAI, Microsoft, Meta, and NVIDIA are accused of using the ‘Books3’ dataset, which was scraped from the library of ‘pirate’ site Bibliotik. After the Books3 accusations hit mainstream news, many AI companies stopped using this source. Meanwhile, anti-piracy companies helped publishers to take the alleged rogue libraries offline to prevent further damage. These enforcement efforts aren’t limited to Books3 either, or the English language for that matter; earlier this week anti-piracy group BREIN reported that it helped to remove a Dutch language dataset.

Earlier this year, several authors sued NVIDIA over alleged copyright infringement. The class action lawsuit alleged that the company’s AI models were trained on copyrighted works and specifically mentioned Books3 data. Since this happened without permission, the rightsholders demand compensation. The lawsuit was followed up by a near-identical case a few weeks later, and NVIDIA plans to challenge both in court by denying the copyright infringement allegations. In its initial response, filed a few weeks ago, NVIDIA did not deny that it used the Books3 dataset. Like many other AI companies, it believes that the use of copyrighted data for AI training is a prime example of fair use; especially when the output of the model doesn’t reproduce copyrighted works.

The authors clearly have a different take. They allege that NVIDIA willingly copied an archive of pirated books to train its commercial AI model, and are demanding damages for direct copyright infringement. This week, the authors and NVIDIA filed a joint case management statement at a California court, laying out a preliminary timeline. This shows that both parties intend to take their time to properly litigate the matter. The authors expect that the parties need until October next year to gather facts and evidence during the discovery phase. An eventual jury trial is penciled in a full year later, November 2026.

NVIDIA doesn’t have a hard trial deadline in mind but stresses that the fair use issue is key, and should be addressed early and efficiently. For starters, the company intends to file a motion for summary judgment within a year, after which both parties should have more clarity. Aside from the timeline, NVIDIA also shared its early outlook on the case. The company believes that AI companies should be allowed to use copyrighted books to train their AI models, as these books are made up of “uncopyrightable facts and ideas” that are already in the public domain. The argument may seem surprising at first; the authors own copyrights and as far they’re concerned, use of pirated copies leads to liability as a direct infringer.

However, NVIDIA goes on to explain that their AI models don’t see these works that way. AI training doesn’t involve any book reading skills, or even a basic understanding of a storyline. Instead, it simply measures statistical correlations and adds these to the model. “Training measures statistical correlations in the aggregate, across a vast body of data, and encodes them into the parameters of a model. Plaintiffs do not try to claim a copyright over those statistical correlations, asserting instead that the training data itself is ‘copied’ for the purposes of infringement,” NVIDIA writes.

Put differently, NVIDIA argues that its AI models don’t use the books the way humans do; neither do they reproduce them. It’s simply examining the ‘facts and ideas’ in the books, ‘transforming’ their original purpose to build a complex AI model. That qualifies as fair use, they state. “Plaintiffs cannot use copyright to preclude access to facts and ideas, and the highly transformative training process is protected entirely by the well-established fair-use doctrine. “Indeed, to accept Plaintiffs’ theory would mean that an author could copyright the rules of grammar or basic facts about the world. That has never been the law, for good reason,” NVIDIA adds.

According to NVIDIA, the lawsuit boils down to two related questions. First, whether the authors’ direct infringement claim is essentially an attempt to claim copyright on facts and grammar. Second, whether making copies of the books is fair use. The chip company believes that it didn’t do anything wrong and cites several cases that will likely appear in its future filings. They include the Authors Guild v. Google lawsuit, where the court of appeals concluded that copying books to create a searchable database was fair use. As a result, Google Books still exists today.

NVIDIA is not the only company that will rely on a fair use defense in response to AI-related copyright infringement claims. Many other companies are taking the same approach so whether it succeeds will prove key for the future of AI model development. What makes these matters more complex is that AI models and technologies have different applications; so what may be fair use in one case, could be copyright infringing in another. For example, earlier this week, a California federal court ruled that a copyright lawsuit filed by visual artists against DeviantArt, Midjourney, Runway AI, and Stability AI, can move forward.

These defendants are also accused of copyright infringement, but the lawsuit deals with images, and image outputs instead. Given the parties involved and the potential damages at stake, these lawsuits will keep the courts busy for years to come. Even after the first ‘final’ verdicts come in, there will be appeals, and some questions may eventually end up at the Supreme Court. Meanwhile, the actions of NVIDIA and other AI companies will be closely monitored by copyright watchers. This includes recent press reports accusing NVIDIA, among others, of scraping both videos and transcripts from YouTube, to train their respective models.”

“The joint case management statement in Nazemian vs. Nvidia is available here (pdf)”

PREVIOUSLY

MACHINE READABLE
https://spectrevision.net/2024/04/25/machine-readable/
DATA HOARDING
https://spectrevision.net/2019/12/18/data-hoarding/
GUERRILLA OPEN ACCESS
https://spectrevision.net/2016/02/18/guerrilla-open-access/